Hey Gordon, do you think the following algorithm will work for me?
1. Obtain the proteins corresponding to the full-length cDNAs. I don't
know what tool I can use. But if I can't find it, I think I can write a
short program to extract the largest possible ORF and translate it to
protein for each cDNA.
2. blastx ESTs against the protein file we got.
3. Filter out all output that has a p-value greater than 1e-6
4. There are three possibilities for the output alignments: a) matches the
beginning of a protein; b) matches the end of a protein; c) matches the
middle of a protein. If an alignment entry doesn't fall within these three
cases, it is discarded.
5. To determine if an alignment entry is in case a) or b), we first
check to see if the alignment contains the ends of the EST. (allows 2%
deviation from ends). Check to see if the protein alignment also
happens at the ends. If both are true, then we will check to see whether
they fall into case a) or b) based on the alignment orientation. If
everything is perfect, we can declare there is a homology match.
6. If the alignment covers the whole EST and it is contained by the
corresponding protein, this will give us case c). This is also a homology
Thanks a lot.