In article <8h8gqh$hv0$1 at news.duke.edu>, "Simon Lin" <simon_lin at med.unc.edu> writes:
>>Anybody has experience of using GCG databases with HMMER?
I've just been working on that recently. This is for OpenVMS, but I
suspect what I say will be true for Unix as well. This is also a work in
progress, as I still have some bugs to iron out.
HMMSEARCH will read a protein database with no problems when I specify it
like "sw:swissprot.seq". However, that doesn't get you anywhere if you
want to use the equivalent of
To deal with that I've used our very hacked up local version of TOFASTA in
a pipe, to translate the DNA to protein and feed it on the fly into
$ pipe tofasta/infile=gb_pr*/frames=123456/out=sys$output | -
hmmsearch blah.hmm sys$input > results.out
1. I had to modify sqio.c in the hmmer package to treat "sys$input" like "-"
on Unix. On unix "hmmsearch blah.hmm - > results.out" should work
without any modificatins.
2. There's some sort of memory problem feeding hmmsearch a really
large translated database in this manner - I only just today wrote to S. Eddy
about that so don't know the solution yet. It may or may not affect
Unix systems. I suspect it will cause the hmmsearch process to use a
lot of memory though. I also suspect that the problem is related to
the huge "protein" fasta entries which result from just translating
an entire genbank DNA entry. Since it seems to depend more on the
composition of the data fed in than on the absolute size of the
In order to work around the memory problem this works:
$ tofasta/infile=gb_pr*/frames=123456/out=sys$output/scommand="@dohmmsearch %s"/split=300/sfrag="frag"
Which chews off 300 entries from the database, translates in all 6 frames,
and writes that to a temporary file. The procedure dohmmsearch then picks
up the file (the name goes in on the %s), feeds it into hmmsearch as a
normal file, and then deletes the temporary file and returns. Tofasta
makes the next one, and round and round it goes.
This definitely works, but you can't just blindly specify an "E" value
because that is proportional to the size of the database. So if you change
the "split" value, you also have to change "E" in the dohmmsearch procedure.
(I've asked Sean if there's some way to specify the score value as a
cutoff, but have not heard back yet.)
As I said, our tofasta has been modified, here are the added switches:
/FRames=123456 Translate into forward/reverse reading frames
creating new entries with names -for1, etc for each
/SPlit=N Create a new output file every N input entries
/SName="frag" Name to give each fragment (required if /SPLIT)
/SCommand="@dothis %s" Command which is run on each fragment written (requires /SPLIT)
I suspect I'm going to have to add a switch to cause translated fragments
to automatically break into overlapping pieces below some maximum size.
Not many programs were written to expect a 100,000 aa protein!
Please email me directly if you have more specific questions.
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech