brett at BORCIM.WUSTL.EDU writes:
>Hello. A while back someone asked about how to interpret results of sequence
>database searches and was pointed to a recent article by the staff at NCBI
>(Nature Genetics v6 p119). I read this article and it has helped a little.
>Would someone care to explain to a biologist how to interpret these numbers!
>I have found some interesting similarities. but would like to know just how
>interesting they are, in terms of statistical significance. I am not a
>mathematician and do not have the inclination to review the entire literature
>on this subject. As an example, how do I read this:
> Score = 59 Length = 36 Expect = 1.4e+01 Sum-Stat P(2) = 2.0e-27
>?? I have read on how scores are summed. Length is intuitive. How about
>"expect" and the P value????
>Other examples include alignments with P values much closer to 1. Does this
>mean they should be considered irrelevant?
Yes, P values close to 1 indicate a lack of statistical significance.
Biological significance can be found from alignments without statistical
significance, but it is definitely skating on thin ice to rely on
weak alignments alone. Another excellent paper on this subject is
Henikoff, Steve. 1991. The New Biologist 3:1148-1154
Playing with blocks: some pitfalls of forcing multiple alignments.
>help! As an aside, I am a graduate
>student and am amazed at how often articles show some alignment and no
>statistical argument. My fellow students see this alignment and take it as
>proof that the sequences must be related. In other words, we need to be
>educated as to how these results can be interpreted! Like I said, if this
>can be explained without delving into the minutae of the algorithms, I think
I've been working off-and-on on a WWW document that goes through all these
things. It's unfinished and will improve in spurts, but
you're welcome to take a look at it at:
On another note, a constant annoyance for me is the fact that many
authors do not sufficiently document the methodology used. For example,
target databases are not specified explicitly ("no matches were found
by database searches") or versions are not stated ("no matches were found
in GenBank"). For BLAST & FASTA searches, the substitution matrix
used is not specified (this was a glaring flaw in a comparison of
search methods posted recently). People don't publish Southern's or
in situs without specifying the conditions; why are computational methods
>it will help a lot of people to utilize these methods to their fullest,
>without overinterpretting the results. Thanks in advance...
Department of Cellular and Developmental Biology
Department of Genetics / HHMI
krobison at nucleus.harvard.edu