Natural Language Query In The Biomedical Domain Based On The Cognition Search

 
                                                                                                                                                                                                                         

Department of Biochemistry                                                                                                                                                                                                                                                                               Cognition Technologies Inc.

UT Southwestern Medical Center,                                                                                                                                                                                                                                                                      6133 Bristol Parkway, Suite 350,     

 5323 Harry Hines Blvd.,                                                                                                                                                                                                                                                                                        Culver City, CA 90230

Dallas, TX-75390                                                                                                                                                                                                                                                                                                    http://medline.cognition.com/ 

 

Goldsmith lab is developing an improved search of Medline biomedical content in collaboration with Cognition Technologies, Inc. of Culver City CA, using their natural language technology.  Several algorithms used by this software enhance user access to Medline by providing better precision and better recall.

 

Cognition’s Semantic Natural Language Processing (NLP) “understands” word and phrase meanings in modern computer applications. The architecture of the software is such that it can determine desired senses of ambiguous words. It also retrieves to synonyms and daughters of words.  Several different methods are being used to improve precision. Two of most powerful are concept clustering and ranking of output by relevancy. Also, encoded ontological (downward reasoning) and synonym relationships improve recall.

 

At the most basic level, Cognition software performs pattern matching when single words are used for a query. However, when the query contains a phrase or a question, the linguistic reasoning and other algorithms come into play. For example, there are three concepts in the query “Genetic correlates of alcoholism” which retrieves documents ONLY related to genetic correlates of alcoholism. The high precision in this case comes from relevancy ranking.

 

One of the most daunting tasks in building a natural language understanding system is to build a semantic map and dictionary with details of the syntactic behavior of words (i.e. how words behave within context). Cognition’s search technology and dictionary have been under development for several years and now contains over 500,000 words and phrases of English and medicine. We are providing them the information about web-based sources, focusing on well-curated websites, of knowledge on genes and protein names and also checking the language entry, synonymy and debugging in the dictionary. So far we have added 40,000 synonym classes, and 2500 ontological relationships. The most frequently used words in Medline unknown to the software were added by hand.

 

Future work will focus on improving coverage of words in Medline, a more complete tree-of-life, drug and chemical names. The present augmentation already provides excellent precision in retrievals, even when one or more of the query terms is not in the dictionary (trypanosoma brucei drug targets, for example).

 

The work is presently being carried out by Mr. Saurabh Mendiratta, who has a master’s degree in Cell and Molecular Biology from UT Dallas. Mr. Mendiratta is helping us with curation and setting up full text search at UT Southwestern. Natural language query implemented by Cognition Search is based upon a "meaning" algorithm patented by Kathleen Dahlgren, the founder and Chief Technical Officer of Cognition Technologies Inc.

 

A manuscript entitled “Natural Language Query in the Biochemistry and Molecular Biology Domains Based on Cognition Search™ is available at http://hhmi.swmed.edu/Labs/bg/cognition_manuscript.  This is accepted for presentation at the AMIA conference to be held at Washington, DC in November 2008.

 

 

Head to Head Comparison of Cognition and Pubmed

 

Cognition vs Medline Search

Cognition

 

 

PubMed

 

 

good/20

bad/20

total

good/20

bad/20

total

genetic correlates of alcoholism

16

4

1436

6

14

44

DNA repair and aging

13

7

1220

11

9

1265

drugs for fibromyalgia

15

5

1484

9

11

220

genetic correlates of prostate cancer

15

5

2301

13

7

60

genetic interactions of BCL2

14

6

876

8

11

19

oxidative stress in plants

15

5

3122

9

11

3197

spectroscopy of amidohydrolases

15

5

861

7

13

1142

enzyme activities of virulence factors

15

5

42

6

14

651

benzene induced neuropathy

14

6

220

6

1

7

insulin secretion induced by PACAP

15

5

50

15

5

39

birth defects from glycol ether

14

6

20

13

7

61

depression in aging

17

3

13381

7

13

3658

 

Cognition

 

 

Medline

 

 

Precision

0.74

 

 

0.45

 

 

Recall*

0.99

 

 

0.46

 

 

* assuming the total is the sum of good retrievals in top 20 or top 10 of Cognition and Pubmed.