Natural Language Query In The
Biomedical Domain Based On The Cognition Search

Department
of Biochemistry Cognition Technologies Inc.
UT Southwestern Medical Center, 6133 Bristol Parkway, Suite 350,
5323 Harry Hines Blvd.,
Culver
City, CA 90230
Dallas,
TX-75390
http://medline.cognition.com/
Goldsmith lab is developing an improved search of Medline biomedical content in collaboration with Cognition Technologies, Inc. of Culver City CA, using their natural language technology. Several algorithms used by this software enhance user access to Medline by providing better precision and better recall.
Cognition’s Semantic Natural Language Processing (NLP) “understands” word and phrase meanings in modern computer applications. The architecture of the software is such that it can determine desired senses of ambiguous words. It also retrieves to synonyms and daughters of words. Several different methods are being used to improve precision. Two of most powerful are concept clustering and ranking of output by relevancy. Also, encoded ontological (downward reasoning) and synonym relationships improve recall.
At the
most basic level, Cognition software performs pattern matching when single
words are used for a query. However, when the query contains a phrase or a
question, the linguistic reasoning and other algorithms come into play. For
example, there are three concepts in the query “Genetic correlates of
alcoholism” which retrieves documents ONLY related to genetic correlates of
alcoholism. The high precision in this case comes from relevancy ranking.
One of the most daunting tasks in building a natural language understanding system is to build a semantic map and dictionary with details of the syntactic behavior of words (i.e. how words behave within context). Cognition’s search technology and dictionary have been under development for several years and now contains over 500,000 words and phrases of English and medicine. We are providing them the information about web-based sources, focusing on well-curated websites, of knowledge on genes and protein names and also checking the language entry, synonymy and debugging in the dictionary. So far we have added 40,000 synonym classes, and 2500 ontological relationships. The most frequently used words in Medline unknown to the software were added by hand.
Future work will focus on improving coverage of words in Medline, a more complete tree-of-life, drug and chemical names. The present augmentation already provides excellent precision in retrievals, even when one or more of the query terms is not in the dictionary (trypanosoma brucei drug targets, for example).
The work is presently being carried out by Mr. Saurabh Mendiratta, who has a master’s degree in Cell and Molecular Biology from UT Dallas. Mr. Mendiratta is helping us with curation and setting up full text search at UT Southwestern. Natural language query implemented by Cognition Search is based upon a "meaning" algorithm patented by Kathleen Dahlgren, the founder and Chief Technical Officer of Cognition Technologies Inc.
A manuscript entitled “Natural Language Query in the Biochemistry and Molecular Biology Domains Based on Cognition Search™” is available at http://hhmi.swmed.edu/Labs/bg/cognition_manuscript. This is accepted for presentation at the AMIA conference to be held at Washington, DC in November 2008.
Head to Head Comparison of Cognition and Pubmed
|
Cognition vs Medline Search |
Cognition |
|
|
PubMed |
|
|
|
|
good/20 |
bad/20 |
total |
good/20 |
bad/20 |
total |
|
genetic correlates of alcoholism |
16 |
4 |
1436 |
6 |
14 |
44 |
|
DNA repair and aging |
13 |
7 |
1220 |
11 |
9 |
1265 |
|
drugs for fibromyalgia |
15 |
5 |
1484 |
9 |
11 |
220 |
|
genetic correlates of prostate cancer |
15 |
5 |
2301 |
13 |
7 |
60 |
|
genetic interactions of BCL2 |
14 |
6 |
876 |
8 |
11 |
19 |
|
oxidative stress in plants |
15 |
5 |
3122 |
9 |
11 |
3197 |
|
spectroscopy of amidohydrolases |
15 |
5 |
861 |
7 |
13 |
1142 |
|
enzyme activities of virulence factors |
15 |
5 |
42 |
6 |
14 |
651 |
|
benzene induced neuropathy |
14 |
6 |
220 |
6 |
1 |
7 |
|
insulin secretion induced by PACAP |
15 |
5 |
50 |
15 |
5 |
39 |
|
birth defects from glycol ether |
14 |
6 |
20 |
13 |
7 |
61 |
|
depression in aging |
17 |
3 |
13381 |
7 |
13 |
3658 |
|
|
Cognition |
|
|
Medline |
|
|
|
Precision |
0.74 |
|
|
0.45 |
|
|
|
Recall* |
0.99 |
|
|
0.46 |
|
|
|
* assuming the total is the
sum of good retrievals in top 20 or top 10 of Cognition and Pubmed. |
|
|
|
|
|
|