Searching NCBI databases using Entrez

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Mortasecca.png Warning: Information is liquid. Records change all the time: info is removed and added. Therefore screenshots may not be up-to-date.

Using GQuery to search NCBI databases

You can also search Genbank through their search portal, GQuery, formerly known as Entrez.

GQuery is the name of the search portal that searches in all NCBI databases (non-sequence databases will be covered later into more detail). GQuery reports how many hits it has found in every db. But more importantly, it allows very finetuned searching on the annotations of sequences.

The major advantage of GQuery is the common interface to the different databases: once you know how to set up a query in one NCBI database, you can do it for any other NCBI database. Not only searches are similar but also the result pages of all NCBI databases look the same in Entrez. The disadvantage is that by linking all these resources together, you can get lost in GQuery...

Exercise 1: HIV sequences

Go to http://www.ncbi.nlm.nih.gov. The search bar you find at the top, is the GQuery search tool. Ncbi search start.png

Standard it is set to All Databases (of NCBI). You can do a query from there, but when you just click Search (with an empty search box), you arrive at the GQuery page.

On the GQuery page you see a list of all the databases that are searched by GQuery, grouped according to topic.

Type the search term hiv in the box and click Search.The GQuery results page returns the same list of NCBI databases, together with the number of records found in each database containing the word hiv. To see the results of a specific database, you click its name or the number of results it generates in the GQuery results table.

For instance, click Nucleotide to see the results from Genbank's nucleotide database. You already know the look of the results page from the Genbank searches.

In GQuery retrieve all records of proteases except those from HIV viruses.

Via GQuery, retrieve all records from proteases with a length between 1000 and 2000 amino acids except those from HIV viruses.


Exercise 2: mouse proteases

Now use GQuery to find the mRNA sequences of all mouse proteases

It seems very strange that the Genomic Survey Sequence database contains mRNA sequences since by defintion it is supposed to include only genomic sequences. Remember: GSSs are defined as the genomic counterparts of ESTs. However, as you can see on the GSS database page, two classes of GSS (exon trapped products and gene trapped products) may be derived via a cDNA intermediate so the GSS database does contain a small number of mRNA sequence records.