Stuart M. Brown wrote:
> > Meanwhile, I am working on the new SRS 5.0 parsing for dbEST, and will
> > certainly be trying to index the blast hits in some way. This gets a
> > little tricky - for example, it is not trivial to combine the scores
> > and the text for a given hits though it should be possible.
> >
> > So, an obvious question: what information would you like to search for
> > in the BLAST hit fields?
>> This is really agonizing. Here is all of this beautiful data, but apparently no
> good way to use it. I don't think that the SRS indicies should be expanded
> to include 15 protein and 15 nucleotide hits (and their names and the
> significance level of each hit). It already takes us anywhere from 6 to 36
> hours to recreate the SRS indicies on our GCG system after each full GenBank
> updaate (and this is on a fast Alpha machine!). Perhaps the time is ripe for
> a new tool - sort of a reverse BLASTer that takes a given sequence and
> identifies all EST's that mention that sequence in their BLAST report.
>
I wouldn't be so pessimistic about using the data in dbEST. The protein
and nucleotide
hits could be even indexed as subentries so it could be possible to
search them
independently of the entries in which they occur. So questions would be
possible like
find me all ESTs that are similar above a certain threshold with a
certain DNA sequence.
The indexing in SRS4 is a problem since the memory usage can be enormous
depending on the
size of the databank. The reason that from your experience indexing can
take from 6 to 36
hours is probably swapping: once SRS runs out of physical memory the
computer must start
swapping and that slows down indexing to almost a halt.
In srs5 that problem is solved by indexing large databanks in chunks and
merging them
later. A good candidate - and one of the reasons I did it - is dbEST.
regards
Thure
...latest predictions for the final release is end of november - am
currently working on
displaying subentries