Searching NCBI databases using Entrez
Using GQuery to search NCBI databases
You can also search Genbank through their search portal, GQuery, formerly known as Entrez.
GQuery is the name of the search portal that searches in all NCBI databases (non-sequence databases will be covered later into more detail). GQuery reports how many hits it has found in every db. But more importantly, it allows very finetuned searching on the annotations of sequences.
The major advantage of GQuery is the common interface to the different databases: once you know how to set up a query in one NCBI database, you can do it for any other NCBI database. Not only searches are similar but also the result pages of all NCBI databases look the same in Entrez. The disadvantage is that by linking all these resources together, you can get lost in GQuery...
Exercise 1: HIV sequences
Go to http://www.ncbi.nlm.nih.gov. The search bar you find at the top, is the GQuery search tool.
Standard it is set to All Databases (of NCBI). You can do a query from there, but when you just click Search (with an empty search box), you arrive at the GQuery page.
On the GQuery page you see a list of all the databases that are searched by GQuery, grouped according to topic.
Type the search term hiv in the box and click Search.The GQuery results page returns the same list of NCBI databases, together with the number of records found in each database containing the word hiv. To see the results of a specific database, you click its name or the number of results it generates in the GQuery results table.
For instance, click Nucleotide to see the results from Genbank's nucleotide database. You already know the look of the results page from the Genbank searches.
|Refine the search in Genbank by looking for "complete genomes" from hiv viruses|
|Do an advanced search
look at the results page
|Download a list of accession numbers of these sequences that can be used for searching ENA at EMBL knowing that ENA cannot process RefSeq accessions?|
When you want to use the accession numbers of the results to search in ENA, you cannot include RefSeq accession numbers since they are not used nor recognized by ENA. Therefore, you first have to select INSDC (Genbank) in the Source database filter in the filter menu.
If the filter is not visible, you can make it visible by clicking Customize in the Source database section and selecting INSDC (Genbank).
This filter selects only primary records from Genbank so no RefSeq records will be present in the results summary. The difference between the Genbank and the INSDC (Genbank) filter is the following:
To download the Accession list:
Don't actually do the download because it would take a lot of time.
|In Nucleotide retrieve all mRNA sequences submitted in 2015 from hiv viruses|
We will only search in the Organism field of the Genbank records for the search term (hiv). The filters will retain records that contain mRNA sequences and that were submitted in 2015. Check how the query was translated in the Search Details box.
|Obtain the records of hiv virus mRNA sequences in all NCBI databases that were released during 2014.|
You only see results for the Nucleotide database. That's because the mRNA filter is a field that is specific for this database.
|Obtain the records of hiv virus sequences in all NCBI databases that were released during 2014.|
Now you see results for multiple databases.
In GQuery retrieve all records of proteases except those from HIV viruses.
|Which database returns the highest number of hits ?|
|You'll have to use fields to solve this one.
So you search for protease NOT hiv [Organism].
Via GQuery, retrieve all records from proteases with a length between 1000 and 2000 amino acids except those from HIV viruses.
|Compare the number of hits in the Protein and the Nucleotide database ?|
|Go to the results summary page of the Protein database by clicking its name or the number of results. |
Activate the Sequence length filter and set the range to 1000 to 2000.
You can use the same syntax in the GQuery search if you want to search all databases for sequences in this length range. So in the GQuery search term box type the following query:
So you search for protease AND 1000[SLEN]:2000[SLEN] NOT hiv [Organism].
Now you see on the results page that the Nucleotide database returns the highest number of hits.
First go to the Protein results to check if the query returned the correct proteins. The search did indeed return proteins with a length between 1000 and 2000 amino acids.
Now check the results for the Nucleotide database. You see that the search on the Nucleotide database returned sequences with a length between 1000 and 2000 base pairs.
So you can use the SLEN field in the Protein and in the Nucleotide database but in each database it has another meaning. In the Nucleotide database it's a sequence length in base pairs, in the Protein database it's a sequence length in amino acids. So the results that are returned by the Nucleotide database are not the results that you wanted.
Exercise 2: mouse proteases
Now use GQuery to find the mRNA sequences of all mouse proteases
|Which databases return hits for this search ?|
|You'll have to use a field for specifying the organism and for specifying that want to retrieve mRNA sequences. When you go to the Search details of the last query of the previous exercise, you see how this filter is translated:
So you search for protease AND biomol_mRNA [PROP] AND mouse [Organism]
It seems very strange that the Genomic Survey Sequence database contains mRNA sequences since by defintion it is supposed to include only genomic sequences. Remember: GSSs are defined as the genomic counterparts of ESTs. However, as you can see on the GSS database page, two classes of GSS (exon trapped products and gene trapped products) may be derived via a cDNA intermediate so the GSS database does contain a small number of mRNA sequence records.