In article <1993Mar2.120540.1382 at gserv1.dl.ac.uk> risler at cgmvax.cgm.cnrs-gif.fr writes:
> Hence I've tried to read the original papers about BLAST and, in particular,
> I've tried to understand how they compute the probability P(N) associated
> with a given score. ... In any
> case, I thought that P(N) was computed from the figures obtained by a very
> large number of simulations. If this was true, then this probability should
> be the same for the same hit whatever the databank used.
The P value is calculated analytically, it is not based on
simulations. Equation [5] of Karlin and Altschul, PNAS (1990) 87:2264
tells us that the probability is a function of the length of the query
sequence, the length of the database sequence, and a factor, lambda,
which is calculated from the scoring matrix and the probabilities of
the residues in the query sequence and in the library.
> A colleague of mine recently searched a protein sequence with BLAST against
> the "non-redundant protein databank" and against Swissprot. She got in both
> cases the same hit with the same score, but with different probabilities.
> With the non-redundant database P(N) was 0.84 and with Swissprot P(N) was
> 0.51. The segment pairs were exactly the same in both cases.
In blast, the P value is corrected for the length of the
database as well. Thus, the same alignment from two different
database searches may have different P values if the databases are
different in length or amino acid composition.