NCBI offers extensive collections of sequences through its BLAST services (http://blast.ncbi.nlm.nih.gov) for comparing and identifying DNA, RNA and protein sequences. NCBI now deposits descriptions of these sequence collections, known as BLAST databases, in a special database called blastdbinfo that you can access through the Entrez Programming Utilities (E-Utilities). Using blastdbinfo, you can enable a program to find an appropriate database and then send BLAST searches to that database using either the BLAST URL API or standalone BLAST (installed locally).
If you’re unfamiliar with the E-Utilities, please see the E-Utilities documentation for a full description of these tools.
Procedure
1. Use esearch.fcgi to find desired BLAST databases (see Table 1 below for a listing of several useful query fields).
esearch.fcgi?db=blastdbinfo&term=<database query>
[Parse out database ID from XML output]
2. Use esummary.fcgi to retrieve metadata about the matching databases.
esummary.fcgi?db=blastdbinfo&term=<database ID>
[Parse out database path from XML output]
3. Run a BLAST search with the desired database.
Blast.cgi?CMD=Put&DATABASE=<database path>&PROGRAM=<program>&query=<query>
Example
For this example, we will look for human BLAST databases containing sequences from the NCBI Reference Sequence (RefSeq) Project. Click on the links to view the results of each step.
1. Use esearch with the following query (see Table 1):
refseq[blast database source] AND human[title]
The first few lines of the returned XML result appear below.
<eSearchResult>
<Count>13</Count>
<RetMax>13</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>1023214</Id>
<Id>1001294</Id>
<Id>998664</Id>
…
2. Use summary to retrieve the names and paths of the databases. In this case, we will use ID 1023214.
The first few lines of the esummary XML appear below.
<eSummaryResult>
<DocumentSummarySet status="OK"><DocumentSummary uid="1023214">
<Name>Human build 37 RNA, reference, and alternate assemblies</Name>
<Path>DBINDEX/9606/allcontig_and_rna</Path>
<Title>human build 37 RNA, alternate and reference assemblies.</Title>
<LastUpdated>2010/11/01 00:00</LastUpdated>
<Description/>
<TotalLength>5886906670</TotalLength>
<MaxLength>115591998</MaxLength>
<NumSequences>50354</NumSequences>
…
The BLAST database name and its path prefix are in the <Path> field. We can use the complete string in this field to compose a search request using the BLAST URL API or standalone blast+.
3. Use the BLAST URL API to invoke the database (in red):
For standalone BLAST, you can invoke the database on the command line:
blastn -db DBINDEX/9606/allcontig_and_rna -remote -query <query_file> …
Table 1 – Some useful query fields in blastdbinfo
Query Field | Sample Values | Example | Function |
[blast sequence strategy](nucleotide databases only) | est gss htgs012 htgs0123 wgs | wgs[blast sequence strategy] | Retrieves all databases containing wgs sequences |
[blast database source] | genbank gnomon pdb refseq sra swissprot | refseq[blast database source] | Retrieves all databases containing RefSeq sequences |
[blast sequence type] | cdna genomic otherdna protein | Protein[blast sequence type] | Retrieves all databases containing protein sequences |
[title] | Text words within the database title | Non-redundant[title] | Retrieves databases with “non-redundant” in their title |
For more information
For a complete list of all available field limits for the blastdbinfo database, visit this link:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=blastdbinfo
For technical assistance on BLAST, write to blast-help@ncbi.nlm.nih.gov.
0 comments:
Post a Comment