Lab 4: BLAST#

Preparation#

FASTA format#

FASTA format is used to represent molecular sequences. The first line begins with > and describes the sequence, while the following lines are the sequence.

Example DNA sequence in FASTA format

>gi|23423|ref|NM_23542.0| Homo sapiens protein
ATGAATCGATACGATAGCTAGCTATCGATGCA
GATCAGAGAGGGGCTTTAGCTAGCTAAGCTAG

Example protein sequence in FASTA format

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Blast GenBank online#

The query sequence in FASTA format is given below

>Sequence3a

TAACCTACGGGTGGCCGCAGTGGGGAATATTGCACAATGGACACAAGTCTGATGCAGCGACGC
CGCGTGGGGGATGAAGGCTTTCGGGTTGTAAACTCCTTTCAGTACAGAAGAAGCATTTTTGTG
ACGGTATGTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCG
CGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGCCTGCTGTG

Navigate to the main BLAST page (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Select the appropriate type of BLAST for your sequence. Paste the query sequence into the box. Click the “BLAST” button and wait for the results.

Once the results are displayed, notice there are three main headings: Graphic Summary, Descriptions, and Alignments.

Blast remotely against GenBank#

Save the sequence3a in a file called lab4_test.fa or download the file lab4_test.fa. We may run the following command line to blast remotely against GenBank. Please make sure to specify the full path of blastn.

blastn -query lab4_test.fa -out lab4_result.txt -remote -db nt

We may get the IDs of the first 100 significant sequences in R.

data=scan("https://book.phylolab.net/binf8441/data/lab4_result.txt",what="character",sep="\n")
index = grep("Sequences producing",data)
ngeneid = 100
geneid = rep("",ngeneid)
for(i in 1:ngeneid){
    x=data[index+i]
    geneid[i] = unlist(strsplit(x,split=" "))[1]
}
print("The geneid of the first 100 significant sequences include")
geneid
[1] "The geneid of the first 100 significant sequences include"
  1. 'MT269339.1'
  2. 'MT269338.1'
  3. 'LC523909.1'
  4. 'LC523908.1'
  5. 'LC523907.1'
  6. 'LC523906.1'
  7. 'LC523905.1'
  8. 'LC523904.1'
  9. 'LC523903.1'
  10. 'LC523902.1'
  11. 'MN946520.1'
  12. 'MN907639.1'
  13. 'MN712476.1'
  14. 'LC500013.1'
  15. 'LC500007.1'
  16. 'LC500006.1'
  17. 'LC500005.1'
  18. 'LC500004.1'
  19. 'MN135984.1'
  20. 'MN027909.1'
  21. 'CP033904.1'
  22. 'CP033902.1'
  23. 'CP029004.1'
  24. 'CP029001.1'
  25. 'CP028833.1'
  26. 'MG461533.1'
  27. 'KX592208.1'
  28. 'KX592205.1'
  29. 'KX592204.1'
  30. 'KX592202.1'
  31. 'KX592201.1'
  32. 'KX462008.1'
  33. 'PP125831.1'
  34. 'PP125829.1'
  35. 'KX815984.1'
  36. 'KU738726.1'
  37. 'KT191139.1'
  38. 'KT191138.1'
  39. 'KT191137.1'
  40. 'KT191136.1'
  41. 'CP123393.1'
  42. 'CP123403.1'
  43. 'CP012649.1'
  44. 'OQ363300.1'
  45. 'OQ363298.1'
  46. 'OQ205177.1'
  47. 'OQ179914.1'
  48. 'KP872907.1'
  49. 'KP159746.1'
  50. 'KP159745.1'
  51. 'CP007003.1'
  52. 'CP096280.1'
  53. 'CP096279.1'
  54. 'CP081508.1'
  55. 'CP097247.1'
  56. 'KM096430.1'
  57. 'ON725016.1'
  58. 'KJ930040.1'
  59. 'KJ842125.1'
  60. 'CP007519.1'
  61. 'KF828869.1'
  62. 'HG530069.1'
  63. 'JX975440.1'
  64. 'JX975434.1'
  65. 'JQ975939.1'
  66. 'JQ975941.1'
  67. 'JQ975938.1'
  68. 'JQ975918.1'
  69. 'JQ975934.1'
  70. 'JQ975932.1'
  71. 'JQ975931.1'
  72. 'JQ975929.1'
  73. 'JQ975919.1'
  74. 'JQ975917.1'
  75. 'JQ975916.1'
  76. 'JQ975914.1'
  77. 'JQ975913.1'
  78. 'JQ975912.1'
  79. 'JQ975910.1'
  80. 'JQ975908.1'
  81. 'JQ975906.1'
  82. 'JQ726530.1'
  83. 'JQ726517.1'
  84. 'JQ726516.1'
  85. 'JQ726507.1'
  86. 'JQ726505.1'
  87. 'JQ404452.1'
  88. 'JN857612.1'
  89. 'JN579657.1'
  90. 'JN578141.1'
  91. 'JN578140.1'
  92. 'JN578139.1'
  93. 'JN578137.1'
  94. 'JN578136.1'
  95. 'JN578135.1'
  96. 'JN578134.1'
  97. 'JN578127.1'
  98. 'JN578126.1'
  99. 'JN578116.1'
  100. 'JN578114.1'

We can retrieve the DNA sequences using the gene IDs we received above.

library(ape)
ngeneid = 2 #to save time we only retrieve the first 2 sequences
seq=rep("",ngeneid)
names(seq)=geneid[1:ngeneid]
for(i in 1:ngeneid){
    seq[i] = read.GenBank(geneid[i])
    Sys.sleep(5) #if too many request, the server will identify it as a spam. here we wait 5 seconds after each request
}
print(paste("The first sequence is", seq[1]))
[1] "The first sequence is as.raw(c(0x48, 0x28, 0x28, 0x48, 0x28, 0x88, 0x18, 0x48, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x48, 0x18, 0x18, 0x48, 0x48, 0x88, 0x88, 0x88, 0x48, 0x88, 0x18, 0x18, 0x18, 0x18, 0x18, 0x18, 0x48, 0x48, 0x88, 0x18, 0x48, 0x48, 0x48, 0x48, 0x88, 0x18, 0x48, 0x48, 0x48, 0x28, 0x18, 0x28, 0x88, 0x28, 0x48, 0x48, 0x28, 0x28, 0x18, 0x88, 0x18, 0x28, 0x88, 0x48, 0x28, 0x18, 0x18, 0x48, 0x18, 0x18, 0x48, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x18, 0x48, 0x88, 0x18, 0x48, 0x48, 0x28, 0x28, 0x18, 0x88, 0x28, \n0x28, 0x88, 0x88, 0x48, 0x48, 0x28, 0x48, 0x18, 0x28, 0x48, 0x88, 0x28, 0x48, 0x48, 0x48, 0x18, 0x88, 0x48, 0x28, 0x28, 0x48, 0x48, 0x28, 0x28, 0x18, 0x48, 0x88, 0x48, 0x88, 0x48, 0x48, 0x48, 0x18, 0x48, 0x88, 0x28, 0x28, 0x48, 0x48, 0x28, 0x28, 0x88, 0x28, 0x88, 0x18, 0x18, 0x48, 0x48, 0x48, 0x88, 0x28, 0x18, 0x48, 0x88, 0x48, 0x88, 0x18, 0x88, 0x28, 0x48, 0x48, 0x28, 0x28, 0x28, 0x88, 0x48, 0x88, 0x28, 0x18, 0x28, 0x28, 0x18, 0x88, 0x28, 0x48, 0x48, 0x48, 0x88, 0x48, 0x48, 0x28, 0x88, 0x48, 0x28, \n0x88, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x88, 0x88, 0x18, 0x88, 0x18, 0x18, 0x48, 0x28, 0x88, 0x28, 0x88, 0x88, 0x18, 0x48, 0x48, 0x88, 0x28, 0x48, 0x28, 0x88, 0x88, 0x48, 0x18, 0x28, 0x18, 0x48, 0x88, 0x18, 0x48, 0x28, 0x88, 0x48, 0x28, 0x48, 0x88, 0x28, 0x48, 0x28, 0x28, 0x48, 0x28, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x48, 0x88, 0x18, 0x48, 0x88, 0x88, 0x48, 0x48, 0x28, 0x18, 0x18, 0x18, 0x28, 0x48, 0x48, 0x48, 0x18, 0x18, 0x48, 0x18, 0x88, 0x88, 0x88, 0x28, 0x18, 0x28, 0x28, 0x18, 0x18, 0x18, \n0x28, 0x88, 0x48, 0x18, 0x88, 0x28, 0x88, 0x48, 0x88, 0x88, 0x48, 0x88, 0x88, 0x48, 0x28, 0x88, 0x18, 0x18, 0x18, 0x18, 0x18, 0x48, 0x18, 0x48, 0x88, 0x28, 0x48, 0x48, 0x18, 0x88, 0x18, 0x48, 0x18, 0x48, 0x28, 0x88, 0x48, 0x88, 0x88, 0x48, 0x88, 0x88, 0x48, 0x28, 0x48, 0x28, 0x28, 0x48, 0x48, 0x28, 0x18, 0x88, 0x88, 0x28, 0x18, 0x88, 0x28, 0x48, 0x18, 0x48, 0x28, 0x28, 0x88, 0x48, 0x28, 0x88, 0x48, 0x28, 0x28, 0x48, 0x28, 0x48, 0x48, 0x18, 0x88, 0x88, 0x18, 0x88, 0x28, 0x48, 0x18, 0x88, 0x48, 0x48, \n0x48, 0x28, 0x48, 0x28, 0x48, 0x88, 0x48, 0x28, 0x48, 0x18, 0x18, 0x48, 0x18, 0x28, 0x28, 0x48, 0x48, 0x88, 0x88, 0x18, 0x18, 0x88, 0x18, 0x18, 0x48, 0x48, 0x48, 0x28, 0x48, 0x18, 0x88, 0x88, 0x88, 0x48, 0x88, 0x48, 0x28, 0x18, 0x28, 0x48, 0x18, 0x88, 0x48, 0x48, 0x28, 0x48, 0x48, 0x18, 0x18, 0x18, 0x48, 0x18, 0x18, 0x48, 0x28, 0x48, 0x28, 0x28, 0x18, 0x48, 0x28, 0x18, 0x48, 0x18, 0x48, 0x88, 0x88, 0x88, 0x48, 0x88, 0x28, 0x28, 0x48, 0x48, 0x48, 0x48, 0x28, 0x18, 0x18, 0x88, 0x88, 0x28, 0x18, 0x18, \n0x28, 0x48, 0x48, 0x48, 0x48, 0x18, 0x18, 0x48, 0x28, 0x88, 0x48, 0x18, 0x48, 0x48, 0x48, 0x18, 0x88, 0x28, 0x48, 0x48, 0x48, 0x28, 0x88, 0x48, 0x88, 0x28, 0x18, 0x88, 0x48, 0x88, 0x48, 0x18, 0x48, 0x18, 0x48, 0x48, 0x18, 0x88, 0x48, 0x48, 0x48, 0x48, 0x18, 0x88, 0x88, 0x18, 0x18, 0x48, 0x48, 0x88, 0x88, 0x18, 0x18, 0x28, 0x28, 0x18, 0x48, 0x48, 0x18, 0x48, 0x18, 0x88, 0x48, 0x28, 0x48, 0x48, 0x18, 0x48, 0x48, 0x88, 0x88, 0x18, 0x48, 0x28, 0x48, 0x28, 0x88, 0x48, 0x88, 0x18, 0x88, 0x18, 0x28, 0x88, \n0x48, 0x48, 0x88, 0x48, 0x48, 0x88, 0x88, 0x28, 0x88, 0x28, 0x28, 0x48, 0x88, 0x18, 0x48, 0x48, 0x28, 0x48, 0x88, 0x88, 0x48, 0x48, 0x28, 0x88, 0x48, 0x48, 0x18, 0x18, 0x88, 0x28, 0x18, 0x48, 0x48, 0x48, 0x28, 0x28, 0x88, 0x18, 0x18, 0x88, 0x28, 0x18, 0x48, 0x88, 0x28, 0x48, 0x28, 0x18, 0x48, 0x88, 0x48, 0x48, 0x88, 0x48, 0x28, 0x48, 0x88, 0x88, 0x88, 0x48, 0x28, 0x48, 0x18, 0x48, 0x48, 0x48, 0x18, 0x88, 0x48, 0x28, 0x48, 0x88, 0x88, 0x28, 0x88, 0x48, 0x48, 0x88, 0x18, 0x18, 0x88, 0x48, 0x88, 0x18, \n0x88, 0x28, 0x28, 0x28, 0x18, 0x48, 0x48, 0x18, 0x88, 0x48, 0x18, 0x28, 0x28, 0x88, 0x28, 0x48, 0x28, 0x18, 0x48, 0x18, 0x88, 0x88, 0x88, 0x28, 0x48, 0x18, 0x18, 0x48, 0x48, 0x48, 0x28, 0x88, 0x28, 0x18, 0x88, 0x48, 0x48, 0x18, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x28, 0x28, 0x18, 0x18, 0x18, 0x18, 0x28, 0x28, 0x88, 0x28, 0x48, 0x48, 0x48, 0x18, 0x18, 0x28, 0x18, 0x48, 0x28, 0x48, 0x28, 0x28, 0x48, 0x18, 0x88, 0x48, 0x28, 0x18, 0x88, 0x88, 0x28, 0x48, 0x28, 0x18, 0x18, 0x18, 0x88, 0x88, 0x48, 0x18, \n0x48, 0x28, 0x28, 0x28, 0x28, 0x48, 0x28, 0x28, 0x18, 0x48, 0x48, 0x48, 0x48, 0x88, 0x48, 0x18, 0x88, 0x28, 0x48, 0x48, 0x28, 0x28, 0x48, 0x28, 0x88, 0x88, 0x48, 0x48, 0x18, 0x18, 0x88, 0x88, 0x88, 0x88, 0x28, 0x18, 0x28, 0x88, 0x88, 0x88, 0x48, 0x48, 0x88, 0x88, 0x18, 0x18, 0x48, 0x88, 0x28, 0x48, 0x48, 0x48, 0x48, 0x48, 0x28, 0x28, 0x28, 0x48, 0x28, 0x88, 0x28, 0x88, 0x88, 0x48, 0x28, 0x48, 0x48, 0x28, 0x48, 0x48, 0x88, 0x48, 0x28, 0x88, 0x18, 0x48, 0x28, 0x48, 0x48, 0x88, 0x18, 0x18, 0x88, 0x88, \n0x18, 0x18, 0x28, 0x48, 0x88, 0x18, 0x48, 0x28, 0x88, 0x88, 0x28, 0x48, 0x28, 0x48, 0x88, 0x88, 0x48, 0x88, 0x88, 0x28, 0x28, 0x18, 0x18, 0x88, 0x28, 0x28, 0x88, 0x88, 0x48, 0x48, 0x28, 0x18, 0x18, 0x48, 0x88, 0x28, 0x88, 0x18, 0x88, 0x28, 0x88, 0x28, 0x18, 0x48, 0x28, 0x48, 0x88, 0x18, 0x48, 0x18, 0x48, 0x28, 0x28, 0x88, 0x48, 0x88, 0x48, 0x88, 0x18, 0x48, 0x48, 0x18, 0x48, 0x28, 0x88, 0x48, 0x28, 0x28, 0x18, 0x18, 0x28, 0x48, 0x48, 0x48, 0x48, 0x18, 0x48, 0x48, 0x18, 0x48, 0x18, 0x88, 0x28, 0x88, \n0x48, 0x48, 0x18, 0x48, 0x48, 0x18, 0x48, 0x28, 0x88, 0x18, 0x48, 0x48, 0x18, 0x18, 0x48, 0x18, 0x28, 0x48, 0x18, 0x28, 0x88, 0x48, 0x28, 0x18, 0x28, 0x48, 0x18, 0x48, 0x18, 0x28, 0x48, 0x18, 0x48, 0x88, 0x48, 0x88, 0x18, 0x48, 0x18, 0x18, 0x48, 0x48, 0x48, 0x18, 0x18, 0x88, 0x88, 0x48, 0x18, 0x28, 0x28, 0x28, 0x48, 0x28, 0x88, 0x88, 0x28, 0x48, 0x88, 0x48, 0x28, 0x48, 0x28, 0x88, 0x88, 0x28, 0x28, 0x28, 0x18, 0x18, 0x48, 0x18, 0x28, 0x28, 0x18, 0x48, 0x18, 0x48, 0x18, 0x18, 0x48, 0x28, 0x28, 0x88, \n0x48, 0x28, 0x88, 0x18, 0x48, 0x18, 0x18, 0x48, 0x18, 0x48, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x88, 0x28, 0x18, 0x28, 0x88, 0x28, 0x48, 0x48, 0x48, 0x88, 0x48, 0x88, 0x28, 0x18, 0x48, 0x28, 0x28, 0x48, 0x48, 0x48, 0x48, 0x18, 0x18, 0x88, 0x88, 0x28, 0x18, 0x28, 0x48, 0x48, 0x88, 0x48, 0x48, 0x88, 0x88, 0x48, 0x48, 0x18, 0x48, 0x48, 0x48, 0x48, 0x88, 0x18, 0x48, 0x88, 0x28, 0x48, 0x18, 0x28, 0x88, 0x88, 0x88, 0x18, 0x28, 0x88, 0x18, 0x28, 0x88, 0x18, 0x48, 0x28, 0x28, 0x28, 0x28, 0x18, 0x18, 0x88, \n0x18, 0x48, 0x18, 0x28, 0x18, 0x18, 0x48, 0x48, 0x48, 0x28, 0x18, 0x18, 0x28, 0x88, 0x28, 0x48, 0x28, 0x88, 0x18, 0x48, 0x28, 0x18, 0x88, 0x28, 0x88, 0x88, 0x18, 0x48, 0x48, 0x28, 0x28, 0x48, 0x48, 0x18, 0x88, 0x28, 0x88, 0x48, 0x88, 0x48, 0x48, 0x48, 0x18, 0x18, 0x48, 0x28, 0x48, 0x88, 0x48, 0x28, 0x28, 0x18, 0x48, 0x18, 0x48, 0x88, 0x48, 0x48, 0x48, 0x18, 0x48, 0x88, 0x48, 0x28, 0x18, 0x88, 0x88, 0x18, 0x28, 0x28, 0x28, 0x18, 0x18, 0x88, 0x88, 0x88, 0x48, 0x28, 0x18, 0x48, 0x48, 0x18, 0x28, 0x18, \n0x28, 0x88, 0x48, 0x18, 0x18, 0x28, 0x48, 0x48, 0x88, 0x18, 0x18, 0x48, 0x48, 0x48, 0x48, 0x18, 0x28, 0x18, 0x48, 0x28, 0x88, 0x88, 0x28, 0x18, 0x28, 0x48, 0x88, 0x28, 0x28, 0x28, 0x28, 0x88, 0x18, 0x48, 0x88, 0x88, 0x48, 0x18, 0x28, 0x48, 0x48, 0x88, 0x48, 0x18, 0x28, 0x48, 0x28, 0x18, 0x88, 0x48, 0x18, 0x88, 0x88, 0x18, 0x28, 0x48, 0x28, 0x88, 0x48, 0x88, 0x18, 0x28, 0x88, 0x48, 0x28, 0x88, 0x88, 0x28, 0x48, 0x28, 0x18, 0x48, 0x28, 0x48, 0x48, 0x18, 0x48, 0x88, 0x88, 0x18, 0x88, 0x28, 0x48, 0x18, \n0x18, 0x28, 0x18, 0x28, 0x48, 0x48, 0x48, 0x28, 0x28, 0x18, 0x18, 0x48, 0x18, 0x88, 0x28, 0x88, 0x28, 0x88, 0x28, 0x28, 0x48, 0x28, 0x28, 0x28, 0x48, 0x18, 0x28, 0x88, 0x28, 0x48, 0x18, 0x28, 0x88, 0x28, 0x48, 0x88, 0x88, 0x88, 0x48, 0x18, 0x18, 0x48, 0x48))"

Save the sequences in a file

write.dna(seq,"seq.fas",format="fasta")

Generate a local blast database#

The Blast package includes a program makeblastdb. We first download and unzip the zebra finch chromosome 1 Taeniopygia_guttata.bTaeGut1_v1.p.dna.primary_assembly.1.fa.gz. A local blast database is constructed by typing the following command line in Terminal (Mac) or the command window (type CMD in Windows). Please make sure to specify the full path of makeblastdb.

~/Bin/ncbi-blast-2.10.0+/bin/makeblastdb -in Taeniopygia_guttata.bTaeGut1_v1.p.dna.primary_assembly.1.fa -dbtype nucl -out Taeniopygia_guttata

Note that we may run the third party programs such as blast in R using the R function system().

 try(system(paste("c://Users/liuliang_2/bin/blast-2.10.0+/bin/makeblastdb.exe -in Taeniopygia_guttata.bTaeGut1_v1.p.dna.primary_assembly.1.fa -dbtype nucl -out Taeniopygia_guttata")))
Warning message in system(paste("c://Users/liuliang_2/bin/blast-2.10.0+/bin/makeblastdb.exe -in Taeniopygia_guttata.bTaeGut1_v1.p.dna.primary_assembly.1.fa -dbtype nucl -out Taeniopygia_guttata")):
“error in running command”

Blast against the local database#

Download the input file lab4_query_local.txt and run the following command line in Terminal (Mac) or the command window (type CMD in Windows).

blastn -query lab4_query_local.txt -db Taeniopygia_guttata -out resultfile -max_target_seqs 5

Exercise#

For each query sequence in the file lab4_query_local, write R code to blast remotely against GenBank and save the sequences of the first 50 significant alignments to a file. Since there are 10 query sequences in the file lab4_query_local, your R code should generate 10 output files. Explain the sequences in each output file.