One potential approach would be to use the BCP-style dumps
of the NCBI dbEST database, which can be found at:
ftp.ncbi.nih.gov
repository/dbEST
The approach might be:
- download the full bcp dump for the 'library' table
- identify all libs for organism "Homo sapiens"
- use the resulting id_libs to screen the full bcp dump
of the 'est' table
if you find that an EST's id_lib matches one of the
Homo sapiens id_libs, then that one should be kept/processed
- from the Homo sapiens ESTs, use their id_est keys to identify
the sequences that are of interest among the sequence.full.* files
And then apply a similar approach for the daily 'delete'
and 'insert' BCP dumps .
There's a README in the directory that might be a good
place to start.
Note that there are nearly 8 million human ESTs in dbEST .
Fairly big job... So if you're going to do it, then
you might want to consider just building a complete local
copy of dbEST, if you have the resources for it.
Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS
>-----Original Message-----
>From: Seth Johnson [mailto:johnson.biotech At gmail.com]
>Sent: Monday, December 18, 2006 11:00 AM
>To: genbankb At magpie.bio.indiana.edu>Subject: [Genbank-bb] Human ESTs
>>Hi all,
>>We are creating a local database of human sequences for
>high-throughput pipeline. So, I have a question regarding the
>availability of sequences by organism. Is there some way I
>can get, for example, just Homo Sapiens ESTs without parsing
>hundreds of files that comprise an EST GenBank release?
>>>--
>Best Regards,
>>>Seth Johnson
>Senior Bioinformatics Associate
>>Fx: (775) 251-0358
>