Running BLAST via CLSD

Running BLAST via CLSD

BLAST is used to search a database of nucleotide or amino acid sequences for entries similar to a known (query) sequence of nucleotides or amino acids. The query sequence may match at one or more positions within a database entry, and multiple segments within the query sequence may match segments within a single database entry.

Each match receives a similarity score, and pairs of matching fragments whose similarities exceed some threshold value are called high scoring segment pairs (HSPs). A database entry containing at least one HSP is considered a BLAST "hit".

NCBI provides a program, blastall, that is used to perform BLAST searches on data sources such as GenBank and SWISS-PROT. DB2 provides an interface to blastall called the BLAST "wrapper," and CLSD uses the DB2 BLAST wrapper to send BLAST requests to blastall, and to present BLAST results as (virtual) relational tables, each identified by its own "nickname".

BLAST can be run by using SQL commands with as simple a syntax as

select [display_variable_names] from ncbi.[BLAST_table_nickname_within_CLSD] where blastseq='[some_query_sequence]' where the bracketed expressions must be replaced by specific information, as follows:
[display_variables]
a list of variables to display in the output table.

[BLAST_table_nickname_within_CLSD]
composed of a [BLAST_search_type] and a [BLAST_database] name separated by an underscore (_), where:

[BLAST_search_type]: (currently) one of BLASTN, BLASTP, or BLASTX
indicates whether to search for nucleotide or peptide sequences and whether to convert the query or database entries from nucleotide to amino acids sequences, and, if so, whether to convert using multiple frames.

[BLAST_database]: one of NT, NR, or SP
specifies which database to use. (Note that not all databases can be used with all search types.)

[some_query_sequence]
a nucleotide or amino acid search (query) sequence,

Here is an example query conforming to the simple syntax:

select GB_ACC_NUM, description from ncbi.BLASTN_NT where BlastSeq = 'AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGCA'

More complex queries can be constructed with SQL commands using a more detailed syntax:

select [display_variables] from NCBI.[BLAST_table_nickname_within_CLSD] where blastseq='[some_query_sequence]' and [expressions_controlling_blastall_search_behavior] and [expressions_controlling_display_of_results] where the new bracketed expressions are replaced as follows:
[expressions_controlling_blastall_search_behavior]
any non-default settings for variables that control the BLAST search parameters sent to blastall, and
[expressions_controlling_display_of_results]
comparisons with variables that control which BLAST results to return.

Here is a more complex example:

select Score, E_Value, HSP_Info, HSP_Q_Seq, HSP_H_Seq, HSP_Midline from ncbi.BLASTN_NT where BlastSeq = 'gagttgtcaatggcgagg' and gapcost=8 and E_Value < .0005 This query specifies a value for the gap cost setting used during the BLAST search and displays only resulting hits with an E value less than .0005.

Note that users may create their own searchable sequence databases by applying the NCBI-supplied formatdb utility to a file containing a collection of sequences in FASTA format. See the IBM documentation for details.

Here is a table showing which search types are supported by the DB2 BLAST wrapper. Only the first 3 types are enabled within CLSD.

BLAST search type Data
sources
Description
BLASTN NT A nucleotide sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence.
BLASTP NR, SP An amino acid sequence is compared with the contents of an amino acid sequence database to find sequences with regions similar to regions of the original sequence.
BLASTX NR, SP A nucleotide sequence is compared with the contents of an amino acid sequence database to find sequences with regions similar to regions of the original sequence. The query sequence is translated in all six reading frames, and each of the resulting sequences is used to search the sequence database.
TBLASTN NT An amino acid sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence. The sequences in the sequence database are translated in all six reading frames, and the resulting sequences are searched for regions similar to regions of the query sequence.
TBLASTX NT A nucleotide sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence. In a TBLASTX search, both the query sequence and the sequence database are translated in all six reading frames, and the resulting sequences are compared to discover similar regions.

 

Specify values for the following variables to control how the BLAST search will be performed

BlastSeq no defaultThe sequence for which to search must be at least 15 bytes long.
E_Value 10Both an input and an output parameter. As an input parameter, this column indicates to the BLAST wrapper the upper limit of expect values that should be returned from blastall.
QueryStrands 3Specifies which strands should be compared when performing a BLASTN search. A value of 1 indicates that the top strand should be used, 2 indicates the bottom strand, and 3 indicates that both strands should be compared.
GapAlign 1Indicates to the wrapper whether gapped alignments are permitted in the BLAST output.
Matrix BLOSUM62Determines which substitution matrix is used by blastall to determine the degree of similarity between pairings of amino acids. Only those BLAST search types that compare amino acids to amino acids use this predicate. The choices are
  • BLOSUM80, PAM1 (for less divergent sequences)
  • BLOSUM62, PAM120
  • BLOSUM45, PAM250 (for more divergent sequences)
  • NMisMatchPenalty 3Specifies the value that blastall deducts from the score of an alignment if one of the pairs of nucleotides in the similar region does not match. Only those BLAST search types that compare nucleotides to nucleotides use this predicate.
    NMatchReward 1Specifies the value that blastall adds to the score of an alignment for each of the pairs of nucleotides in the similar region that do match. Only those BLAST search types that compare nucleotides to nucleotides use this predicate.
    FilterSequence TIndicates to blastall whether to perform filtering to remove biologically uninteresting segments from the query sequence. If the search type is BLASTN, the filter used is DUST. Otherwise, filtering is performed by SEG.
    NumberOfAlignments 250Specifies how many HSP alignments to include in the BLAST output.
    GapCost 11Specifies the value that blastall deducts from the score of an alignment if a gap must be introduced in either the query sequence or the hit sequence to allow the length of the alignment to grow.
    ExtendedGapCost 1Specifies the value that blastall deducts from the score of an alignment if a gap that was already introduced in either the query sequence or the hit sequence must be extended by one nucleotide or amino acid to allow the length of the alignment to grow.
    WordSize 11 for BLASTN;
    3 for BLASTP
    Indicates to blastall the length of the initial hits that blastall initially searches in the database.
    ThresholdEx 0Indicates the score threshold below which BLAST does not attempt to extend a hit any further

     

    Use these variables to control which BLAST results are displayed

    You can compare the following variables with specific values within WHERE clauses to control which BLAST results are included in the output tables.

    Name Data type Description
    Score DOUBLE The computed score for an HSP as reported in the BLAST results.
    E_value DOUBLE Both an input and an output parameter. As an output parameter, this column provides the computed score for an HSP as reported in the BLAST results.
    Length INTEGER The length of the hit sequence as reported in the BLAST results.
    HIT_NUM INTEGER The hit number as reported in the BLAST results, starting with 1.
    HSP_NUM INTEGER The HSP number as reported in the BLAST results, starting with 1.
    HSP_Info VARCHAR(100) The information string for the given HSP, as reported by BLAST. This string contains information about the number of nucleotides or amino acids that matched between the query sequence and the hit sequence.
    HSP_ALIGNMENT_LENGTH INTEGER The length of the HSP alignment.
    HSP_IDENTITY INTEGER The percent identity of the alignment defined as the number of identities divided by the alignment length.
    HSP_GAPS INTEGER The percent gaps in the alignment defined as the number of gaps divided by the alignment length.
    HSP_POSITIVE INTEGER The percent positives of the aligment defined as the number of positives divided by the alignment length.
    HSP_QUERY_FRAME INTEGER The reading frame of the alignment in the query sequence; only available for BLASTX, TBLASTN, and TBLASTX type servers.
    HSP_HIT_FRAME INTEGER The reading frame of the alignment in the hit sequence; only available for BLASTX, TBLASTN, and TBLASTX type servers.
    HSP_Q_Start INTEGER The numeric position of the first nucleotide or amino acid in the region of similarity within the query sequence.
    HSP_Q_End INTEGER The numeric position of the last first nucleotide or amino acid in the region of similarity within the query sequence.
    HSP_Q_Seq VARCHAR(32000) The segment of the query sequence beginning at HSP_Q_Start and ending at HSP_Q_End. You can override the default data type for this column and specify CLOB, with a maximum length of 5 megabytes.
    HSP_H_Start INTEGER The numeric position of the first nucleotide or amino acid region of similarity within the hit sequence.
    HSP_H_End INTEGER The numeric position of the last nucleotide or amino acid region of similarity within the hit sequence.
    HSP_H_Seq VARCHAR(32000) The segment of the hit sequence beginning at HSP_H_Start and ending at HSP_H_End.
    HSP_Midline VARCHAR(32000) The string output by BLAST that indicates the degree of similarity between the amino acids or nucleotides at each position in the similar regions of the query and hit sequences.

     

    Additional information that can be displayed

    SQL statements that invoke BLAST may also produce tables containing any of the fields listed below:

    GI_CONST VARCHAR
    GI_NUM VARCHAR
    DB_NAME VARCHAR
    GB_ACC_NUM VARCHAR
    GB_ACC_VER VARCHAR
    GB_ACC_NUM2VARCHAR

    The NCBI schema also provides a simple table to map GenBank accession numbers from the NR databank to GI numbers:

    TableFieldTypeDescription
    GI_ACC_NR GB_ACC_NOVARCHAR(15) GenBank Accession
    GI_IDINTEGER GenInfo Identifier

    You can find more information about using BLAST within CLSD in the IBM documentation describing the BLAST wrapper or Efficient Access to BLAST using IBM DB2 Information Integrator

    This material was taken from relevant IBM documentation.