I came across an odd issue where BLAST would omit blast results and I found it was due to BLAST's low-complexity filter (which was on by default). So now we know, and knowing is half the battle.

"BLAST filters (-F Flag) filter the query sequence for
low-complexity sequences. The default is ON (T). Complexity
filtering is generally a good idea, but it may break long HSPs
(high-scoring segment pairs) into several smaller HSPs due to
low-complexity segments. This can cause some alignments to fall
below the significance threshold and be lost. To prevent this,
either turn off filtering (not recommended) or use soft masking (-F
"m S", in which the filter is used on in the word seeding phase but
not the extension phase."
    --- BLAST: Essential guide to the Basic Local Alignment Search Tool.

After trying out different combinations, I will show you this in action with a worked example. We have two EspD genes (part of the type III secretion system), one from E. coli sakai [Genbank: BA000007] (virEspD), and another from Citrobacter rodentium ICC168 [Genbank:FN543502] (citroEspD). Here are some blast results: Filter off, citro as input. match from base pair 287 - 746.

wlan-n114-43:~ nabil$ blastall -i citroEspD -d virEspD -pblastn -FF -m8
gi|282947233:3149926-3151068 virEspD 85.00 460 69 0 380 839 746 287 1e-104 365

Filter on, citro as input. match from base pair 287 - 446 & 555 - 746. The match is fragmented.

wlan-n114-43:~ nabil$ blastall -i citroEspD -d virEspD -pblastn -FT -m8
gi|282947233:3149926-3151068 virEspD 84.66 176 27 0 380 555 746 571 2e-35 135
gi|282947233:3149926-3151068 virEspD 84.38 160 25 0 680 839 446 287 1e-30 119

Filter on, Sakai as input. match from base pair 287 - 446

wlan-n114-43:~ nabil$ blastall -i virEspD -d citroEspD -pblastn -FT -m8
virEspD gi|282947233:3149926-3151068 85.00 460 69 0 287 746 839 380 1e-104 365
virEspD gi|282947233:3149926-3151068 78.05 164 36 0 962 1125 164 1 1e-06 40.1
virEspD gi|282947233:3149926-3151068 100.00 12 0 0 562 573 452 463 0.062 24.3

filter off, Sakai as input. match from base pair 287 - 446

wlan-n114-43:~ nabil$ blastall -i virEspD -d citroEspD -pblastn -FF -m8
virEspD gi|282947233:3149926-3151068 85.00 460 69 0 287 746 839 380 1e-104 365
virEspD gi|282947233:3149926-3151068 78.05 164 36 0 962 1125 164 1 1e-06 40.1
virEspD gi|282947233:3149926-3151068 100.00 12 0 0 562 573 452 463 0.062 24.3

Ok wait, Sakai EspD doesn't have the same problem as citrobacter, why? In Sakai it doesn't matter what order you put it in, but in citrobacter it does. Let's blast the sequences against themselves. filter off,  CitroVscitro,  match from base pair 1 - 1143

wlan-n114-43:~ nabil$ blastall -i citroEspD -d citroEspD -pblastn -FF -m8
gi|282947233:3149926-3151068 gi|282947233:3149926-3151068 100.00 1143 0 0 1 1143 1 1143 0.0 2266

filter on,  CitroVscitro,  match from base pair 1 - 614, 670-1143.

wlan-n114-43:~ nabil$ blastall -i citroEspD -d citroEspD -pblastn -FT -m8
gi|282947233:3149926-3151068 gi|282947233:3149926-3151068 100.00 614 0 0 1 614 1 614 0.0 1176
gi|282947233:3149926-3151068 gi|282947233:3149926-3151068 100.00 474 0 0 670 1143 670 1143 0.0 940

This is the EXACT same sequence input, but we are getting a gap. Why? Because that region is deemed low-complexity/repetitive and being filtered out BEFORE any alignment is done. We do not see this problem in the Sakai gene.

wlan-n114-43:~ nabil$ blastall -i virEspD -d virEspD -pblastn -FT -m8
virEspD virEspD 100.00 1125 0 0 1 1125 1 1125 0.0 2230
virEspD virEspD 100.00 12 0 0 829 840 840 829 0.061 24.3
virEspD virEspD 100.00 12 0 0 836 847 847 836 0.061 24.3

So if we get such a difference, why does BLAST use filtering by default. Well, the issue comes back to NCBI using large databases, using BLAST searches with a sequence of low-complexity would give back a number of false-positives by random chance in a very large database say nr/nt or human genome. BLAST by default (and online) will filter out these sequences, which is normally a good thing.


Comments

comments powered by Disqus