Searching NCBI databases using Entrez

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Warning: Information is liquid. Records change all the time: info is removed and added. Therefore screenshots may not be up-to-date.

Using GQuery to search NCBI databases

You can also search Genbank through their search portal, GQuery, formerly known as Entrez.

GQuery is the name of the search portal that searches in all NCBI databases (non-sequence databases will be covered later into more detail). GQuery reports how many hits it has found in every db. But more importantly, it allows very finetuned searching on the annotations of sequences.

The major advantage of GQuery is the common interface to the different databases: once you know how to set up a query in one NCBI database, you can do it for any other NCBI database. Not only searches are similar but also the result pages of all NCBI databases look the same in Entrez. The disadvantage is that by linking all these resources together, you can get lost in GQuery...

Exercise 1: HIV sequences

Go to http://www.ncbi.nlm.nih.gov. The search bar you find at the top, is the GQuery search tool.

Standard it is set to All Databases (of NCBI). You can do a query from there, but when you just click Search (with an empty search box), you arrive at the GQuery page.

On the GQuery page you see a list of all the databases that are searched by GQuery, grouped according to topic.

Type the search term hiv in the box and click Search.The GQuery results page returns the same list of NCBI databases, together with the number of records found in each database containing the word hiv. To see the results of a specific database, you click its name or the number of results it generates in the GQuery results table.

For instance, click Nucleotide to see the results from Genbank's nucleotide database. You already know the look of the results page from the Genbank searches.

Refine the search in Genbank by looking for "complete genomes" from hiv viruses
Do an advanced search Organism -> hiv add AND "complete genome" look at the results page

Download a list of accession numbers of these sequences that can be used for searching ENA at EMBL knowing that ENA cannot process RefSeq accessions?
When you want to use the accession numbers of the results to search in ENA, you cannot include RefSeq accession numbers since they are not used nor recognized by ENA. Therefore, you first have to select INSDC (Genbank) in the Source database filter in the filter menu. If the filter is not visible, you can make it visible by clicking Customize in the Source database section and selecting INSDC (Genbank). This filter selects only primary records from Genbank so no RefSeq records will be present in the results summary. The difference between the Genbank and the INSDC (Genbank) filter is the following: INSDC Genbank: all primary records that are now present in Genbank regardless the database in which they were deposited. So a returned record can originally come from EMBL or DDBJ but because the 3 databases exchange records it is now also present in Genbank. Genbank: only records that were deposited in Genbank. To download the Accession list: Click the Send to link Select File as Destination Select Accession List as Format Click the Create File button Don't actually do the download because it would take a lot of time.

In Nucleotide retrieve all mRNA sequences submitted in 2015 from hiv viruses
Remove all filters set in the previous exercise Do an advanced search: search for hiv in the Organism field only Activate the mRNA filter Activate the Release date filter and enter the range 2015/01/01 to 2015/12/31 We will only search in the Organism field of the Genbank records for the search term (hiv). The filters will retain records that contain mRNA sequences and that were submitted in 2015. Check how the query was translated in the Search Details box.

Obtain the records of hiv virus mRNA sequences in all NCBI databases that were released during 2014.
Copy the query translation of the previous search Go to the GQuery start page Paste the query into the GQuery search box Change 2015 into 2014 Click Search You only see results for the Nucleotide database. That's because the mRNA filter is a field that is specific for this database.

Obtain the records of hiv virus sequences in all NCBI databases that were released during 2014.
Remove biomol_mrna[PROP] from the query Click Search Now you see results for multiple databases.

In GQuery retrieve all records of proteases except those from HIV viruses.

Which database returns the highest number of hits ?
You'll have to use fields to solve this one. To find all proteases you need to search for protease To exclude these from hiv viruses you need the NOT operator To specify that you want to exclude proteases originating from hiv viruses you need to use the Organism field So you search for protease NOT hiv [Organism]. On the results page you see that the Protein database returns the highest number of hits.

Via GQuery, retrieve all records from proteases with a length between 1000 and 2000 amino acids except those from HIV viruses.

Compare the number of hits in the Protein and the Nucleotide database ?
Go to the results summary page of the Protein database by clicking its name or the number of results. Activate the Sequence length filter and set the range to 1000 to 2000. Look in the Search details box how this filter is translated in the actual query that is used to search the database: You can use the same syntax in the GQuery search if you want to search all databases for sequences in this length range. So in the GQuery search term box type the following query: To find all proteases you need to search for protease To select proteases of a certain length you need the AND operator To specify the length, you need the SLEN field. Remember a range is specified using a : (see Entrez help) To exclude all proteases from hiv viruses you need the NOT operator To specify that you want to exclude proteases from hiv viruses you need to use the Organism field So you search for protease AND 1000[SLEN]:2000[SLEN] NOT hiv [Organism]. Now you see on the results page that the Nucleotide database returns the highest number of hits. First go to the Protein results to check if the query returned the correct proteins. The search did indeed return proteins with a length between 1000 and 2000 amino acids. Now check the results for the Nucleotide database. You see that the search on the Nucleotide database returned sequences with a length between 1000 and 2000 base pairs. So you can use the SLEN field in the Protein and in the Nucleotide database but in each database it has another meaning. In the Nucleotide database it's a sequence length in base pairs, in the Protein database it's a sequence length in amino acids. So the results that are returned by the Nucleotide database are not the results that you wanted.

Exercise 2: mouse proteases

Now use GQuery to find the mRNA sequences of all mouse proteases

Which databases return hits for this search ?
You'll have to use a field for specifying the organism and for specifying that want to retrieve mRNA sequences. When you go to the Search details of the last query of the previous exercise, you see how this filter is translated: To find all proteases you need to search for protease To select mRNA sequences of proteases you need the AND operator To specify that you only want mRNA sequences, you need the PROP or Properties field and biomol_mRNA as a search term To select mRNA sequences of proteases from mouse you need the AND operator To specify that you want to select proteases from mouse you need to use the Organism field So you search for protease AND biomol_mRNA [PROP] AND mouse [Organism] On the results page, you see that the search returns records from the Nucleotide, EST and GSS database.

It seems very strange that the Genomic Survey Sequence database contains mRNA sequences since by defintion it is supposed to include only genomic sequences. Remember: GSSs are defined as the genomic counterparts of ESTs. However, as you can see on the GSS database page, two classes of GSS (exon trapped products and gene trapped products) may be derived via a cDNA intermediate so the GSS database does contain a small number of mRNA sequence records.

Searching NCBI databases using Entrez

Using GQuery to search NCBI databases

Exercise 1: HIV sequences

Exercise 2: mouse proteases

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox