Isn't FASTA with optimization identical to the Smith-Waterman? The
optimization step in FASTDB is precisely a Smith-Waterman scoring of the top
5,000 sequences, and hence FASTDB with optimization is a Smith-Waterman
analysis on those sequences. The only difference in this case are very
remotely related sequences which do not show up in the top 5,000 sequences in
the initial pass. BLAZE and MPSearch rectify this by basically doing the
Smith Waterman on the entire database.
I would be interested to know how you handle the "gold-standard" problem.
Whenever I try to test the sensitivity of a database search program using a
particular query, I attempt to find all the members of the super family to
which the query belongs. I try to use super families in which 1) there is a
broad range of degrees of relationship to get a maximal test of the search
sensitivity and 2) families in which there is some independent measure of
relationship other than just sequence similarity. False negatives (family
members missing from the search results at some threshold) are relatively easy
to handle, but false negatives (sequences not known to be in the superfamily)
give me problems. Although not necessarily a member of a super family, the
high scoring "false negatives" are often proteins that share at least a
structural or functional motif with part of the query sequence and hence I
cannot truly classify them as false positives. The only way I have found
around this problem is to 1) not report ROC curves or 2) use short queries
which themselves are a single motif or protein domain. This seems to me to be
a common problem of using a "long" query with a local alignment algorithm (SW,
FASTA, FASTDB, BLAST etc.).