Archive

Archive for the ‘watson explorer’ Category

Implementing Natural Language Query with IBM Watson Explorer

If you have a Watson Explorer (WEX) collection and want to be able to handle with Natural Query Language, you need to know that since WEX release 11.0.1, it have a native component to handle with this – its the query-modifier service.

Basically, this service parse the queries and apply some strategies, transforming the query in Keywords that WEX can understand and apply in the queries. Lets suppose that user search is:

“I’m looking for a Java Developer that know Struts and Spring and work from Brazil.”

The service will extract the keywords, based on configurations, and will search for:

Java Developer + Struts + spring + Brazil

We need to keep in mind that NLQ is different from Cognitive. This service will not understand questions, it will just extract terms. If you are looking for cognitive, you are looking for Watson (https://www.ibm.com/watson/developercloud/). With Watson we can understand the text and apply filter using location, range, etc. This also can be done using Machine Learning Models created at Watson Knowledge studio. But, Ill talk about this soon.

Backing to Query-Modifier, if you look at the folder nlq, inside Engine folder from your WEX installation, you will find the configuration stuff. Query Modifier work this way:

You make a request to WEX telling that you will use QM, the request pass through QM that apply the strategies, then, it forward the request to WEX Engine, who respond to you.

Here is a simple REST call that is using query-modifier:

http://MY_SERVER:9080/vivisimo/cgi-bin/velocity?v.app=api-rest&v.username=MY_USER&v.password=MY_PASSWORD&v.indent=true&v.function=query-search&fetch-timeout=30000&output-display-mode=limited&arena=MY_ARENA&output-contents-mode=list&syntax-operators=and+or+%28%29+CONTAINING+CONTENT+%25field%25%3A+%2B+NEAR+-+NOT+NOTCONTAINING+NOTWITHIN+OR0+quotes+regex+stem+THRU+BEFORE+FOLLOWEDBY+weight+wildcard+wildchar+WITHIN+WORDS+site+less-than+less-than-or-equal+greater-than+greater-than-or-equal+equal+range&sources=MY_COLLECTION+&output-contents=FIELD1+FIELD2&output-bold-contents=FIELD1&query=java+developer&query-condition-xpath=%24CONDITION_EXAMPLE=%27true%27&query-object=&num-per-source=20&start=0&num=20&query-modification-macros=enhance-query-with-querymodifier

See that the following make WEX use Query Modifier:

&query-modification-macros=enhance-query-with-querymodifier

In order to configure, go to <your WEX install folder>/Engine/nlq , in my case /opt/IBM/dataexplorer/WEX-11_0_1/Engine/nlq

Run “chmod +x querymodifier-install.sh”

Then “./querymodifier-install.sh” (as root)

You will see this kind of output:

Copying /opt/IBM/dataexplorer/WEX-11_0_1/Engine/examples/nlq/querymodifier/querymodifier-production.yml.defaults to /opt/IBM/dataexplorer/WEX-11_0_1/Engine/nlq/querymodifier-production.yml…

Configuring port to 9080…

Configuring path to vivisimo/cgi-bin/velocity…

Configuring PEARs path to /opt/IBM/dataexplorer/WEX-11_0_1/Engine/data/pears…

Copying querymodifier-2.1.9.jar to /opt/IBM/dataexplorer/WEX-11_0_1/Engine/nlq/querymodifier.jar…

Giving executable permissions to /opt/IBM/dataexplorer/WEX-11_0_1/Engine/nlq/querymodifier.jar…

Removing any existing /etc/init.d/querymodifier…

Linking /etc/init.d/querymodifier to …

Done.

Its important to change owner of the created files to WEX instance owner, in my case dataexp, so, as root: chown -R dataexp: <your WEX install folder>/Engine/nlq/

The configuration file is called querymodifier-production.yaml

In the first part of the file, you will see the WEX server setting, like IP, port and user.

After this you can setup the strategies, in my case I have this setup:

#The strategies to apply, by default, to each query. Can also be customized on a per-request basis (“workplan” GET parameter):

strategies:

default: PhraseWhitelistStrategy POSBasedNoiseWordRemoverStrategy DictionaryBasedNoiseWordRemoverStrategy DisjunctifyStrategy

The first strategy it the Disjunctify. It converts AND operators into OR operators, if the operator has more terms than a threshold. For example, if you set minimumRequiredTerms = 4, if user search for less terms than 4, query will be (A AND B AND C AND D), if you search for more than 4 terms, query will be (A OR B OR C OR D OR X OR …..).

The Dictionary-Based Noiseword Removal strategy, basically remove words from the query. For example, if you add BANANA to the list, then if user search for BANANA, it will be ignored. Usually we add to this section the common STOPWORDS, you can find several lists, I recommend use the google one. Another good list is here.

The Phrase Whitelist Strategy its interesting, you can have some external config files for some keyphrases, for example, lets suppose that you want that “Project Manager” be searched and “Project Manager”, and not “Project” and “Manager”, so, you need to add this word in the config file.

We have a secret here: you need to separate the words with <TAB> instead of space, else it will not work.

After configure your strategies, you just need to start the service (usually /etc/init.d/query-modifier start) and perform the REST Calls to test. You can follow the log at /var/log/querymodifier.log.

Every time that you change this setting, you need to recycle query modifier.

Your best friend to help with development and test, its the Api Runner interface from WEX engine. You can access this at:

http://YOUR_SERVER:9080/vivisimo/cgi-bin/velocity?v.app=api-run&v.function=query-parse-querymodifier

See the parameters there and ENJOY!

For more references: http://www.ibm.com/support/knowledgecenter/SS8NLW_11.0.1/com.ibm.watson.wex.fc.nlq.doc/c_wex_adding_nlq.html

Introdução a Big Data e Apache Solr

Para quem está interessado em Big Data e além disso quer algo prático utilizando Apache Solr, disponibilizo um conjunto de slides que podem ser utilizados por Estudantes, Professores e profissionais. Usem e distribuam a vontade!

Field x Facet

Many people who start to work with Apache Solr or Watson Explorer have the first primary doubt about What is the difference from Field against Facet.

We can simple define that the field represent the indexed data, and the facet its such a GROUP, combined usually by volume. As described at Solr documentation “faceting is the arrangement of search results into categories based on indexed terms”.

Its possible to configure FACETS with many options like sort, searchable, etc. Each product have its possibilities.

Think that when you perform a query, you will use FACETs in the “Where”, something like this:

Select FIELDS from COLLECTION where FACET MY_FACET = ‘dummy’;

This will make you understand better!

Enjoy!

Enabling Wildcard in a collection at Watson Explorer

Eventually we need to enable search using wildcards like * for a collection at Watson Explorer. For sure this can make our queries consume more CPU and Memory, you can think comparing a query that perform a “select … where field = ‘XXX'” against a query that perform a “select …. where field like ‘*XXX'” (pseudo code). What will be faster? So, think carefully before enable this!

To enable, go to your collection configuration -> Indexing -> Term expansion support (4)  , and check Generate Dictionaries.

Selection_356

For more information, check here and here.

Enjoy!

 

Finding bottlenecks at Watson Explorer queries

If you are having problem with some Watson Explorer query, an excellent way to find bottlenecks is to perform the query with Debug and Profile options enabled, it will help you to find where exactly you have problems.

Usually, when you perform a query at WEX, you call some URL like the following (in my case port is 7205, MY_COLLECTION can be a shard, for example MY_COLLECTION_1_1):

<SERVER>:<PORT>/search?collection=MY_COLLECTION&query-xml=<%3fxml version%3d”1.0″ encoding%3d”UTF-8″%3f><operator logic%3d”and”%2f>&num=1&max=1&binning-mode=normal&start=0&show-duplicates=1&doc-axl=<%3fxml version%3d”1.0″ encoding%3d”UTF-8″%3f><document key-hash%3d”{vse%3adoc-hash()}”%2f>&binning-config=<%3fxml version%3d”1.0″ encoding%3d”UTF-8″%3f><binning-sets><binning-set bs-id%3d”VENDOR” logic%3d”or” max-bins%3d”8″ select%3d”%24VENDOR”%2f><binning-set bs-id%3d”REVENUE_USD_FACET” logic%3d”or” max-bins%3d”11″ select%3d”%24REVENUE_USD_FACET”%2f>……………field%3d”SERVICE_AREA”><field-to name%3d”SERVICE_AREA”%2f><%2ffield-map><field-map field%3d”MAX_IGS_REV_OM_BRAND_CD”><field-to name%3d”MAX_IGS_REV_OM_BRAND_CD”%2f><%2ffield-map><field-map field%3d”EMAIL_SENT”><field-to name%3d”EMAIL_SENT”%2f><%2ffield-map><field-map field%3d”REVENUE_USD_FACET”><field-to name%3d”REVENUE_USD_FACET”%2f><%2ffield-map><field-map field%3d”REVENUE”><field-to name%3d”REVENUE”%2f><%2ffield-map><field-map field%3d”CLIENT_NAME”><field-to name%3d”CLIENT_NAME”%2f><%2ffield-map><%2ffield-mapping>&sort-keys=1&score=1&shingles=0&summarize=0&gen-key=0&cache-data=0&force-binning=1&output-acls=1

If you don’t have IDEA about HOW to get the query that your Application is doing, you can enable Debug at your collection. Go to WEX console, under Configuration -> Searching -> Debugging and enable Query Logging.

Selection_355

When saved, it will start to generate log in a file called queries.log, under you collection folder, some place like:

/opt/IBM/dataexplorer/WEX-11/Engine/data/search-collections/YYY/MY_COLLECTION/crawl1/

You can check it at WEX console, under your collection configuration, tab META, field Filebase.

Ok, now, if you call this URL from your browser, appending “&debug=1&profile=1″ to the URL, you will got a XML file. Download it and lets analyze. For our case, see this:

<xpath-performances>
<xpath-performance xpath=”($FIELD_X) = ‘GBS – No’ or ($FIELD_X) = ‘GBS – Yes'” slow-ms=”10295″ n-slow=”192000″ n-fast=”0″ n-direct=”0″ n-hashes=”1″ />
</xpath-performances>

THIS tell me that JUST in order to get the field FIELD_X, I’m having slow! (I’m my case it is because my Field its an Array)

So, probably I have a problem with this field, that can be a lot, for example:
1- Null values (see my other posts)
2- Its an array to index
3- Its a long text field
4- You have a lot of possible statements using it (OR, AND, WHERE, etc)

With this information, you can go to next step, that is find a way to change the field and make it work better.

Important: I tested this with Watson Explorer 9, 10 and 11. Running at Linux Machines.

Enjoy!

Watson Explorer performance decrease with null values

Working with Watson Explorer (WEX) we saw that the search performance decrease a lot when you have null values for some field/facet. (Our WEX release at this moment is 11, we run at Linux machines and our application was written in Java, using BigIndex to index and search. (Also have pure REST version of our application in test and the problem still happen)).

For example: lets suppose that you have a facet called VENDOR in an entity called Product. Suppose that you have 5 millions Products indexed and for some of then you have nulls, in my case 2 Millions have NULL values for VENDOR field.

In this case (and similar ones), we notice a performance decrease in searches. We start to see problems when the relation of null are greater than 20%.

In order to solve the problem we have 2 options:

1- One technique we’ve used for certain dimensions is to always ensure non-null values in the index — so at index time, we either coalesce in our SQL pulls from DB2 or do transformation after ingestion to replace nulls with some predefined value. In our case we use the literal string “(no value available)”.  It: a) ensures non-null values, b) is fairly meaningful to users, and c) gives users a way to actually filter on those records if needed.

2- For some FIELDS, we can not add another values, must leave null (Business reasons) and must not show null option to user select in the facet. In this case, in the moment of search we append boolean($FIELD) to the query. For example:

General example of a slow facet query:
$VENDOR=’IBM or $VENDOR=’PEPSICO’
Rewritten query that solve the slowness:
boolean($VENDOR) and ($VENDOR=’IBM or $VENDOR=’PEPSICO’)
This way, it ignore the null values when searching and it will be very fast.
Maybe this is not the final solution and for newer WEX versions it will handle better with null values, but, this was the solution for us.
Enjoy!