In article <852567C3.00695290.00 at 7crmta_md.ms.bd.com>,
Bill_A_Nussbaumer at ms.bd.com writes:
> I hope I'm not hi-jacking this post, but I'm somewhat unfamiliar with the topic
> and have been following along. Could someone explain to me what exactly defines
> a "low complexity" protein sequence. Is it just the short length or the
> repetitious nature of the amino acids contained?
According to : http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#LCR
whose english is much more understandabel :
<<Q: What is low-complexity sequence?
Regions with low-complexity sequence have an unusual composition and
this can create problems in sequence similarity searching
(Wootton & Federhen, 1996). Low-complexity sequence can often be
recognized by visual inspection. For example, the protein sequence
PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence
AAATAAAAAAAATAAAAAAT. Filters are
used to remove low-complexity sequence because it can cause artifactual hits
(please see Q: After running a search why do I see a string
of "X"s (or "N"s) in my query sequence that I did not put there?
In BLAST searches performed without a filter, often certain hits will be
reported with high scores only because of the presence of a
low-complexity region. Most often, this type of match cannot be thought
of as the result of homology shared by the sequences. Rather, it is
as if the low-complexity region is "sticky" and is pulling out many
sequences that are not truly related. >>
Much more details in Methods Enzymol 1996;266:554-71 I think.
Filers used by blast are Seg/Xnu or Dust. You could probably find the
corresponding documentation around on NCBI site. Seg for proteins
and Dust for nucleotides. See the "Filter" section of :
Anyway, this one more euristic added to Blast, as pointed by Andrew
this could be sometimes unappropriate. Such sequence could be
statistically unusable or should be analyzed by other means.