Perform a database Craw from a collection its something very common. With IBM Watson Explorer this is something very easy to do. In my example, Ill create a collection and will perform a simple query in a IBM DB2 database, but, the steps will be very similar for other databases, you just need to keep in mind that you will need the correct driver.
1- Put the driver in place:
Get the database jdbc and put in the correct folder, usually it is something like /opt/IBM/dataexplorer/WEX-11_0_2/Engine/lib/java/database/.
2- Create the collection copying defaults from default:
3- Add a new seed, this is where your collection will get data:
4- Choose Database:
5- Enter your database settings and the query that will be performed:
6- Its done, now you can test:
7- This can take a while depending on your query and connection, but when it finish, it will show some rows that the query returned in the following format. To see some row data, click Crawler XML:
8- Here is your data:
9- Now that we see that its working, you can start your craw. This step will feed your collection and can take a good time depending on your amount of data:
10-You must see Craw activity:
11- You can query your collection now to test, just enter your term and click search in the left options:
12- You will see something like this:
Thats it, you have created a collection that get data from Database!
Ontolection Trainer its a nice stuff that people who are using ontolections to Improve the Queries at Watson Explorer need to know. This utility help us to analyze text body and create Thesaurus files, that can be used to create ontolections. Also, you can extract key-phrases or Acronyms that you can use with query-modifier and at some ontolection.
If you don’t know NLQ capabilities at Watson Explorer (WEX) or don’t know what is a Ontolection, I recommend that you read my 2 posts:
Backing to Ontolection Trainer, at NLQ folder (/opt/IBM/dataexplorer/WEX-11_0_2/Engine/nlq in my case) from your WEX installation (since rel 11.0.1), you can find the jar file ontolectiontrainer.jar. Obviously you will need Java to run it. Make sure that the JAVA from WEX installation are configured at your path.
The utility have several arguments, but, the basics are:
- the type of extraction
- the corpus that you will use: The corpus are your text file. In my case, I have a file with 1000 Resumes that Ill use to train WEX (RESUME_TEXT_1000.TXT ).
- the pear file: Pear file consist in the dictionary that the trainer will user to extract terms.
- the output path: Where it will create the file.
I have used a file called blacklist containing the words that I want to be ignored.
You can have problems with CPU and Memory utilization, for this cases, there are parameters to setup the number of iterations that trainer will do.
To be very objective, here is my commands:
- To extract the ontolection:
java -jar ontolectiontrainer.jar –trainOntolection –corpus RESUME_TEXT_1000.TXT –pear /opt/IBM/dataexplorer/WEX-11_0_2/Engine/data/pears/en.pear –blacklist blacklist –outputPath generatedOntolection_1000
- To extract Acronyms:
java -jar ontolectiontrainer.jar –extractAcronyms –corpus RESUME_TEXT_1000.TXT –pear /opt/IBM/dataexplorer/WEX-11_0_2/Engine/data/pears/en.pear –blacklist blacklist –outputPath generatedOntolectionAcronyms_1000
- To extract Phrases:
java -jar ontolectiontrainer.jar –learnPhrases –corpus RESUME_TEXT_1000.TXT –pear /opt/IBM/dataexplorer/WEX-11_0_2/Engine/data/pears/en.pear –blacklist blacklist –outputPath generatedOntolectionPhrases_1000
For more reference:
Para quem está interessado em Big Data e além disso quer algo prático utilizando Apache Solr, disponibilizo um conjunto de slides que podem ser utilizados por Estudantes, Professores e profissionais. Usem e distribuam a vontade!
Eventually we need to enable search using wildcards like * for a collection at Watson Explorer. For sure this can make our queries consume more CPU and Memory, you can think comparing a query that perform a “select … where field = ‘XXX'” against a query that perform a “select …. where field like ‘*XXX'” (pseudo code). What will be faster? So, think carefully before enable this!
To enable, go to your collection configuration -> Indexing -> Term expansion support (4) , and check Generate Dictionaries.
O Watson Analytics é uma ferramenta que nos permite efetuar análises de grandes massas de dados (big data). Em linhas gerais: você define as fontes de dados, o mesmo efetua uma varredura e análise contextual, e prepara seus dados para serem estudados. Importante citar que você pode ter N fontes de dados, dos mais distintos (planilhas, bancos, urls, etc).
Qualquer pessoa pode brincar com a ferramenta, que está disponível em http://watsonanalytics.com/
Eu criei um vídeo bem simples, em português, mostrando como subir uma planilha e efetuar um simples estudo. O mesmo pode ser visto logo abaixo. O Watson tem uma vasta documentação e inúmeros vídeos na Internet. Vale a pena ver.
A muito tempo eu não recompilava um Kernel, e para falar a verdade, acabei até esquecendo algumas etapas (no Ubuntu)…
Fui fazer isso hoje e refresquei a memória lendo um ótimo post do Alexandro Silva, o post pode ser encontrado aqui: http://penguim.wordpress.com/2006/11/14/compilando-o-kernel-no-ubuntu-linux/