> WRT your original question, I think it is a bit open-ended. A lot depends
> on what you want to do with the data. For instance, do you need a clean
> dataset of excellent ORFs, or do you want to get _all_ the reasonable ORFs
> from the ests?
>> If you explain the purpose and outline the steps, you'll get plenty of
> advice, I'm sure.
The purpose of the project is to clone all ORF's that can be identified as
"full length". I realize being able to make that judgement is itself
problematic. The "flow" I currently have in mind goes something like this:
1. Bring GenBank, Unigene and EST datasets in-house
2. "Clean" the raw data (EST, GenBank) by screening for vector, Alu, Mito,
E.coli sequences using ??? BLAST?
3. Take "cleaned" data and reassemble Unigene clusters into contigs as
much as possible using ???? (Consed, Lucy, SPACE, TIGR Assembler, other??)
4. Output consensus files
5. Scan for largest ORF using ??? (ORFinder from NCBI?)
6. If largest ORF has ATG and a STOP, clone it by PCR, otherwise save for
future alignments. I realize it would be nice to check frameshift etc. as
7. Once cloned, they will need to be sequenced completely, and the
sequence checked against the database (?Phred/Phrap/Consed/BLAST?)
I hope this helps clarify what I'm looking for. If there is a glaring
oversight in the flow, I would appreciate comments regarding that as well.
Thanks in advance,