Running BLAST via CLSD
BLAST is used to search a database of nucleotide or amino acid sequences for entries similar to a known (query) sequence of nucleotides or amino acids. The query sequence may match at one or more positions within a database entry, and multiple segments within the query sequence may match segments within a single database entry.
Each match receives a similarity score, and pairs of matching fragments whose similarities exceed some threshold value are called high scoring segment pairs (HSPs). A database entry containing at least one HSP is considered a BLAST "hit".
NCBI provides a program, blastall, that is used to perform BLAST searches on data sources such as GenBank and SWISS-PROT. DB2 provides an interface to blastall called the BLAST "wrapper," and CLSD uses the DB2 BLAST wrapper to send BLAST requests to blastall, and to present BLAST results as (virtual) relational tables, each identified by its own "nickname".
BLAST can be run by using SQL commands with as simple a syntax as
- [display_variables]
- a list of variables to display in the output table.
- [BLAST_table_nickname_within_CLSD]
- composed of a [BLAST_search_type] and a [BLAST_database] name separated
by an underscore (_), where:
- [BLAST_search_type]: (currently) one of BLASTN, BLASTP, or BLASTX
- indicates whether to search for nucleotide or peptide sequences and whether
to convert the query or database entries from nucleotide to amino acids
sequences, and, if so, whether to convert using multiple frames.
- [BLAST_database]: one of NT, NR, or SP
- specifies which database to use. (Note that not all databases can be used with all search types.)
- [some_query_sequence]
- a nucleotide or amino acid search (query) sequence,
Here is an example query conforming to the simple syntax:
More complex queries can be constructed with SQL commands using a more
detailed syntax:
- [expressions_controlling_blastall_search_behavior]
- any non-default settings for variables that control the BLAST search parameters sent to blastall, and
- [expressions_controlling_display_of_results]
- comparisons with variables that control which BLAST results to return.
Here is a more complex example:
Note that users may create their own searchable sequence databases by applying the NCBI-supplied formatdb utility to a file containing a collection of sequences in FASTA format. See the IBM documentation for details.
Here is a table showing which search types are supported by the DB2 BLAST wrapper. Only the first 3 types are enabled within CLSD.
| BLAST search type | Data sources |
Description |
|---|---|---|
| BLASTN | NT | A nucleotide sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence. |
| BLASTP | NR, SP | An amino acid sequence is compared with the contents of an amino acid sequence database to find sequences with regions similar to regions of the original sequence. |
| BLASTX | NR, SP | A nucleotide sequence is compared with the contents of an amino acid sequence database to find sequences with regions similar to regions of the original sequence. The query sequence is translated in all six reading frames, and each of the resulting sequences is used to search the sequence database. |
| TBLASTN | NT | An amino acid sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence. The sequences in the sequence database are translated in all six reading frames, and the resulting sequences are searched for regions similar to regions of the query sequence. |
| TBLASTX | NT | A nucleotide sequence is compared with the contents of a nucleotide sequence database to find sequences with regions similar to regions of the original sequence. In a TBLASTX search, both the query sequence and the sequence database are translated in all six reading frames, and the resulting sequences are compared to discover similar regions. |
Specify values for the following variables to control how the BLAST search will be performed
| BlastSeq | no default | The sequence for which to search must be at least 15 bytes long. |
|---|---|---|
| E_Value | 10 | Both an input and an output parameter. As an input parameter, this column indicates to the BLAST wrapper the upper limit of expect values that should be returned from blastall. |
| QueryStrands | 3 | Specifies which strands should be compared when performing a BLASTN search. A value of 1 indicates that the top strand should be used, 2 indicates the bottom strand, and 3 indicates that both strands should be compared. |
| GapAlign | 1 | Indicates to the wrapper whether gapped alignments are permitted in the BLAST output. |
| Matrix | BLOSUM62 | Determines which substitution matrix is used by blastall to determine the degree of similarity between pairings of amino acids. Only those BLAST search types that compare amino acids to amino acids use this predicate. The choices are
|
| NMisMatchPenalty | 3 | Specifies the value that blastall deducts from the score of an alignment if one of the pairs of nucleotides in the similar region does not match. Only those BLAST search types that compare nucleotides to nucleotides use this predicate. |
| NMatchReward | 1 | Specifies the value that blastall adds to the score of an alignment for each of the pairs of nucleotides in the similar region that do match. Only those BLAST search types that compare nucleotides to nucleotides use this predicate. |
| FilterSequence | T | Indicates to blastall whether to perform filtering to remove biologically uninteresting segments from the query sequence. If the search type is BLASTN, the filter used is DUST. Otherwise, filtering is performed by SEG. |
| NumberOfAlignments | 250 | Specifies how many HSP alignments to include in the BLAST output. |
| GapCost | 11 | Specifies the value that blastall deducts from the score of an alignment if a gap must be introduced in either the query sequence or the hit sequence to allow the length of the alignment to grow. |
| ExtendedGapCost | 1 | Specifies the value that blastall deducts from the score of an alignment if a gap that was already introduced in either the query sequence or the hit sequence must be extended by one nucleotide or amino acid to allow the length of the alignment to grow. |
| WordSize | 11 for BLASTN; 3 for BLASTP | Indicates to blastall the length of the initial hits that blastall initially searches in the database. |
| ThresholdEx | 0 | Indicates the score threshold below which BLAST does not attempt to extend a hit any further |
Use these variables to control which BLAST results are displayed
You can compare the following variables with specific values within WHERE clauses to control which BLAST results are included in the output tables.
| Name | Data type | Description | |
|---|---|---|---|
| Score | DOUBLE | The computed score for an HSP as reported in the BLAST results. | |
| E_value | DOUBLE | Both an input and an output parameter. As an output parameter, this column provides the computed score for an HSP as reported in the BLAST results. | |
| Length | INTEGER | The length of the hit sequence as reported in the BLAST results. | |
| HIT_NUM | INTEGER | The hit number as reported in the BLAST results, starting with 1. | |
| HSP_NUM | INTEGER | The HSP number as reported in the BLAST results, starting with 1. | |
| HSP_Info | VARCHAR(100) | The information string for the given HSP, as reported by BLAST. This string contains information about the number of nucleotides or amino acids that matched between the query sequence and the hit sequence. | |
| HSP_ALIGNMENT_LENGTH | INTEGER | The length of the HSP alignment. | |
| HSP_IDENTITY | INTEGER | The percent identity of the alignment defined as the number of identities divided by the alignment length. | |
| HSP_GAPS | INTEGER | The percent gaps in the alignment defined as the number of gaps divided by the alignment length. | |
| HSP_POSITIVE | INTEGER | The percent positives of the aligment defined as the number of positives divided by the alignment length. | |
| HSP_QUERY_FRAME | INTEGER | The reading frame of the alignment in the query sequence; only available for BLASTX, TBLASTN, and TBLASTX type servers. | |
| HSP_HIT_FRAME | INTEGER | The reading frame of the alignment in the hit sequence; only available for BLASTX, TBLASTN, and TBLASTX type servers. | |
| HSP_Q_Start | INTEGER | The numeric position of the first nucleotide or amino acid in the region of similarity within the query sequence. | |
| HSP_Q_End | INTEGER | The numeric position of the last first nucleotide or amino acid in the region of similarity within the query sequence. | |
| HSP_Q_Seq | VARCHAR(32000) | The segment of the query sequence beginning at HSP_Q_Start and ending at HSP_Q_End. You can override the default data type for this column and specify CLOB, with a maximum length of 5 megabytes. | |
| HSP_H_Start | INTEGER | The numeric position of the first nucleotide or amino acid region of similarity within the hit sequence. | |
| HSP_H_End | INTEGER | The numeric position of the last nucleotide or amino acid region of similarity within the hit sequence. | |
| HSP_H_Seq | VARCHAR(32000) | The segment of the hit sequence beginning at HSP_H_Start and ending at HSP_H_End. | |
| HSP_Midline | VARCHAR(32000) | The string output by BLAST that indicates the degree of similarity between the amino acids or nucleotides at each position in the similar regions of the query and hit sequences. |
Additional information that can be displayed
SQL statements that invoke BLAST may also produce tables containing any of the fields listed below:
| GI_CONST | VARCHAR |
|---|---|
| GI_NUM | VARCHAR |
| DB_NAME | VARCHAR |
| GB_ACC_NUM | VARCHAR |
| GB_ACC_VER | VARCHAR |
| GB_ACC_NUM2 | VARCHAR |
The NCBI schema also provides a simple table to map GenBank accession numbers from the NR databank to GI numbers:
| Table | Field | Type | Description |
|---|---|---|---|
| GI_ACC_NR | GB_ACC_NO | VARCHAR(15) | GenBank Accession |
| GI_ID | INTEGER | GenInfo Identifier |
You can find more information about using BLAST within CLSD in the IBM documentation describing the BLAST wrapper or Efficient Access to BLAST using IBM DB2 Information Integrator
This material was taken from relevant IBM documentation.



