CoCiter randomly selects 1000 gene sets with the same size as A within the same species, and the CI (random, B) is calculated for each of the 1000 random gene sets with B. To assess the significance of co-citation of gene set A with another gene set or term set B, a Monte Carlo approach is used to evaluate random expectations. Thus, N represents the total counts of abstracts that have at least one gene/term in set A and one gene/term in set B co-cited. CI of a gene/term set A with another gene/term set B is defined as, where, where A i is the set of PubMed abstracts within which the i-th gene/term in set A is cited, and B j is the set of abstracts within which the j-th gene/term in set B is cited. CI of a gene/term with another gene/term is defined as, where N is number of abstracts both gene/term appears. We define a log-transformed paper count, which we call co-citation impact (CI), to penalize study biases for star-like genes or terms. With this expansion, the dataset now contains 38,261,321 records.Īssessing the significance of co-citation The original gene2pubmed dataset contains 7,565,397 records. If a query finds too many reports, the top 500 best-matched records are retrieved (in the original “gene2pubmed” dataset, only 0.007% of total genes have >500 co-citations each). We manually examined this rule on 50 randomly selected human genes (Table S1 in File S1). “AKT1 gene”, to assure the accuracy of the query. The expanded dataset is generated by using the NCBI E-utilities to search for PubMed abstracts that contain an Entrez gene name and the word “gene”, e.g. CoCiter checks against both the NCBI “gene2pubmed” dataset (downloaded from NCBI FTP in Oct, 2011) and an expanded “gene2pubmed” dataset based on our own mapping. Additionally, using full text or regular expressions to search for the gene symbol will result in a large number of reports, most of them being false positives and unrelated to the gene. Although the “gene2pubmed” dataset contains manually curated information for gene co-citation, its coverage is small – most genes have less than 10 co-citations ( Figure S1). To find the co-citation literature related to a gene, Caipirini and Martini use the “gene2pubmed” dataset (a manually curated gene to PubMed literature relationship dataset provided by NCBI) CoPub uses regular expression to search against Medline abstracts. HomoloGenes for Homo sapiens, Mus musculus, Drosophila melanogaster and Caenorhabditis elegans were downloaded from NCBI FTP site in October 2011. We demonstrate that CoCiter provides a flexible and more precise approach to analyzing gene set functions, compared with the traditional function enrichment analysis. gene sets from GO/KEGG, 2) query gene set with any user-defined free term set, e.g. CoCiter can evaluate the significance of co-citation for two types of queries: 1) query gene set with any pre-defined/manually-curated gene set, e.g. Here, we have developed an application program called “CoCiter” that is able to evaluate the significance of co-citation for any gene set from the 8,077,952 genes in the National Center for Biotechnology Information (NCBI) Entrez gene database, by using a text mining approach against the up-to-date Medical Literature Analysis and Retrieval System Online (MEDLINE) literature database. In addition to ranking articles – or predicting Protein-Protein Interactions (PPIs), , scientific literatures are also be widely used to interpret the functions of a gene set –. The PubMed abstracts contain all the essential information of the papers and therefore are an important resource for text mining. PubMed is the largest biomedical knowledgebase that is comprised of over 21 million abstracts and is growing at an alarming rate. However, the drawback of GO/KEGG-related functional association analyses is that both GO and KEGG only maintain a controlled vocabulary of terms, which prevents them from analyzing genes that do not have GO/KEGG annotations. A quick way of inferring functions is by using the gene function enrichment analysis tools, such as DAVID and BiNGO, which infers overrepresented functions in a gene set from Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) curated terms and pathways. A basic task in biological research is to uncover or validate the functions of genes, such as candidate genes from a genetic screen and differentially expressed genes from microarray or RNA-seq experiments.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |