Don't hesitate to comment below if you have any questions or additional phrases
Extracting Knowledge From Genomic Experiments By Incorporating the Biomedical Literature James P. Sluka, Ph.D.InPharmix Incorporated [email protected]
(317)-422-1464 InPharmix Inc.
,Value and Size of the Biomedical Literature Cost to sequence the Human Genome, ~$3B
Yearly expenditures for biomedical research, &gt;$100B Size of GENBANK, ~13GB Size of literature abstracts (MEDLINE), ~12GB 70 Bioinformatics Companies† 4 Scientific Text Mining Companies† †From: http://www.phrma.org/ and http://123genomics.homestead.com/files/companies.html InPharmix Inc.
,What is needed: Tools to integrate Genomics, Bioinformatics and the Literature Sequence Data Rational framework for the understanding of biological process or disease e.g., GENBANK User Guidance, Heuristics or NLP Genomics Data Relationships between entities Gene Lists Scientific Literature Abstracts and Papers InPharmix Inc.
,Dataset We have analyzed a subset of the NCI-60 cancer gene expression database .
The initial set consisted of the expression data for the full 9,703 genes for the three leukemia cell lines in the NCI database, CCRF-CEM, MOLT4 and K-562.
CCRF-CEM and MOLT-4 are from acute lymphoblastic leukemias (ALL) whereas K-562 represents acute myelogenous leukemia (AML). The K-562/AML data was divided by the average for the two ALL lines in order to reduce the influence of genes characteristic of leukocytic cell lines. The resulting data is similar to the data set of Golub et al.  used for CAMDA-2000. The modified expression values were sorted and the 250 most highly expressed genes used as the initial data set. For these 250 genes we removed un-named genes including ESTs, KIAA's and genes annotated as &quot;similar to&quot; another gene, resulting in a final list of 160 named genes.
In addition, we included a term for the disease, Acute Myelogenous Leukemia (AML). As our literature database, we used MEDLINE accessed via Entrez.  Scherf, U. et al., &quot;A gene expression database for the molecular pharmacology of cancer&quot;. Nature Genetics, 24:3 (2000), 236-44.  Golub, T.R. et al., &quot;Molecular classification of cancer: class discovery and class prediction by gene expression monitoring&quot;, Science, 286 (1999), 531-537. InPharmix Inc.
,Assigning Names The first step in the analysis is to assign names for each gene that are suitable for searching in MEDLINE.
In this case, the original names are those that appear in the NCI-60 database. Since these names tend to be brief, cryptic or outdated some work needed to be done to verify or correct the names. To assign the best possible name to each gene we used keyword and/or BLAST searches across several databases (GENBANK, OMIM, GDB and GeneCards). InPharmix Inc.
,PDQ_MED Algorithm I The basic input to PDQ_MED is a list of query terms encompassing the genes, proteins, diseases or other concepts under investigation.
An individual query term can consist of more than one version of a particular name. Interleukin-1b, IL-1b, IL1beta… In addition, the user may explicitly join phrases by any of the boolean operators or use any of the field or date operators supported by MEDLINE.ZAG BUTNOT ZIG Searches are carried out by constructing suitable Entrez URLs for all possible pairwise combinations of the query terms. The URLs are then submitted via the web and the search results captured and analyzed by PDQ_MED. InPharmix Inc.
,PDQ_MED Input Page (partial) InPharmix Inc.
,PDQ_MED Algorithm II: Proximity Searching and Local Acronyms A refinement to the basic search strategy is to require a higher degree of &quot;dependence&quot; (closer proximity within the document) between two query terms. In &quot;Proximity&quot; searching, PDQ_MED examines all abstract containing two terms and determines if the terms co-occur in the same sentence. Sentence level proximity searching is not supported by MEDLINE.
Acronyms make proximity