From genmark@ford.gatech.edu Fri May 15 17:04:34 1992
Received: from ford.gatech.edu by sunflower.bio.indiana.edu
	(4.1/9.5jsm) id AA04791; Fri, 15 May 92 17:04:33 EST
Received: from ford.gatech.edu by ford.gatech.edu (AIX 3.1/UCB 5.61/4.03)
          id AA23218; Fri, 15 May 92 18:00:05 -0400
Date: Fri, 15 May 92 18:00:05 -0400
From: genmark@ford.gatech.edu ("Electronic Mail Server")
Message-Id: <9205152200.AA23218@ford.gatech.edu>
Subject: Instructions
Apparently-To: gilbertd@sunflower.bio.indiana.edu
Status: R

          GENMARK : SYSTEM FOR PREDICTING PROTEIN CODING REGIONS
                           Version 1.1  4/15/92
                    (Internet Electronic Mail Server)


GENERAL INFORMATION

     GenMark is a software package available from the Georgia Tech     
     School of Applied Biology  &  Office of Information Technology 
     for the quick analysis of newly sequenced DNA.

     GenMark 1.1 is based on a special type of Markov chain model of coding   
and noncoding nucleotide sequences. It proves to be a quite sensitive indicator
of protein coding regions in E.Coli and closely related species. The yield of
false positive predictions from the analysis of a 96bp segment is about 10%, for
false negatives, about 22.5% . The process for training the program for other
species is fairly straightforward, and new species will be added later, based on
demand and available information.

     GenMark is robust to the presence of ambiguities in newly sequenced DNA -
up to 10% of the sample DNA may be indicated by ambiguity symbols.

     GenMark receives its submissions from your local electronic mail service  
and will reply with a list of open reading frames that it recognizes as protein
coding regions. There are also various other options, such as a PostScript(tm) 
graph of the results, which may optionally be requested. GenMark should reply  
within an hour of a sequence's submission by way of electronic mail.


SUBMISSION OF SEQUENCES FOR ANALYSIS

     Nucleotide sequences destined for processing should be sent via E-mail to:

	genmark@ford.gatech.edu

The subject line of this message must contain one of three keywords:

     instructions
     registration
     genmark

     If the subject of the message is "instructions", GenMark will reply with
the most current submission instructions and news available on the system.

     If the subject of the message is "registration", your message will be
logged in a registration roster. It is NOT necessary to register in order to
use GenMark. If you decide to register, we ask that you include your name, your
E-mail address, and a brief list of the organisms which you would like to see
supported in future versions of GenMark (the family Enterobacteriaceae should
be fairly well represented by the E. Coli information).

     We will keep those persons who register informed with further developments
in the software and its options.

     If the subject of the message is "genmark", the program will try and
analyze the contents of the message as sequence information. The message
should minimally have the word "data" on a line by itself, followed by the
sequence information (see below for a discussion on how to supply options and
some example submissions).


SUPPLYING OPTIONS TO GENMARK

     No options are required for GenMark to function. The options specified
below just change the manner in which the program works. Only one option is
permissible per line. All of the options must occur before the keyword "data"
and the sequence information. ALL OF THE KEYWORDS MUST BE ENTERED IN LOWERCASE
LETTERS, the sequence itself doesn't matter.

     The options:

     #            A comment. The rest of the line, after this symbol, is
                  utterly ignored.

     address      Alternative E-mail address. After this option, include a
                  valid E-mail address to which the program should send the
                  output to (if it is different than the address from which
                  it was sent).

     name         The name of the person who submitted the sequence. This is
                  particularly important for sites where several people will
                  be submitting sequences from the exact same E-mail address.
                  After this option, include the name.

     order        The Markov chain order to use. If you don't know what this
                  is don't mess with it. Higher is better, up to a point. The
                  default is 4, though orders 1 through 5 are now available. 
                  After this option, include the new order.

     psgraph      Give PostScript(tm) output. This instructs the program to
                  include a PostScript graph of the results which can be
                  printed on any PostScript compatible printer. The page is
                  divided into six horizontal panels with the probability
                  function on the y-axis, and the nucleotide position along
                  the x-axis. The six panels represent the six different frames,
                  panels 1-3 indicate frames 1-3 on the direct strand, and
                  panels 4-6 indicate frames 1-3 on complementary strand. Open
                  reading frame indicators appear along the middle of each
                  graph. Since there's a limit to the size of E-Mail messages,
                  expect the PostScript output to be sent as several messages.

     step         Set the window step. This must be stated as a multiple of
                  3 nucleotides. The default is 12. The practical upshot of
                  this setting, is that it allows you some freedom in adjust-
                  ing the resolution of the PostScript(tm) graph. For instance,
                  step setting of 3 gives 4 times the resolution of the default 
                  of 12. 

     threshold    Set the open reading frame threshold. This number is the
                  number between 0 and 1 (or between 0 and 100) that is the
                  minimum value of the probability function (a percentage)
                  that an open reading frame must have to be accepted as a
                  protein coding region. The default is 0.50.

     title        The title you want to give to your PostScript(tm) graph.

     window       The size of the analysis window (if you don't know what 
                  this is, don't play with it). The default is 96 nucleotides 
                  and generally 96 to 144 nucleotides works best.


SAMPLE SUBMISSIONS TO GENMARK

SAMPLE 1

> mail genmark@ford.gatech.edu
Subject: genmark
# This example shows a minimal submission, just using the defaults set by 
# the program.
#
# NOTE: this will reply automatically to the exact address that it was sent
# from with only a list of open reading frames.
#
# The actual DNA sequence may have any standard ambiguity DNA symbols in it
# Anything that isn't a letter (like numbers, punctuation, spaces, carriage 
# returns) will just be ignored.  
data
TCSSATGCATGHCATCGATWWCTCAGTCAGNA...


SAMPLE 2

> mail genmark@ford.gatech.edu
Subject: genmark
# This is an example of using all of the different options.
address biologist@college.edu
name John Doe
order 5
psgraph
step 6
threshold 0.50
title John Doe's New Protein Coding Region
window 144
data
TCAGTTCCAAGGTTTCCCAAAGGGTTTTCCCCAAAAGGGG...


THINGS TO WATCH OUT FOR

     The sendmail program used for transferring messages across the network
is limited to messages that are 64000 characters long. Therefore, it is good
to remember to send any imformation you might have in chunks smaller than the
64000 character limit.

     The PostScript(tm) output might take up more space than is permissible
in a mail message so, GenMark will send the graphic in parts that are smaller
than 64K in length.

     If you shrink the step down to 3 and send a good sized sequence, the
PostScript(tm) output will be huge, so don't be suprised. Try and reserve doing
that for smaller sequences. For short sequences, you'll want to make the step
smaller. We suggest a step of 6 for any sequence under about 1.5kb long, and
a step of 3 for sequences less than about 800 bases long.

     Don't ask the program to make the step larger than the window. It won't
crash the program, but then again you'll probably just get garbage back.

     The sequences you send are deleted as soon as they have been processed
by the program. We cannot recover them for you. If you do not receive a
response in a couple of hours, something's wrong. Verify the format of your
submission and resend it.

     The graphic response may be effective for analyzing the intron/exon
structure of eukaryotic sequences, but there are no guarrantees. In such a
case, the list of open reading frames would almost certainly be useless, only
the graphic would make any sense.

     In many cases, the graphic output can tell you much more information
about the sequence in question than the open reading frame listing alone.
Careful evaulation of the graphic could yield clues as to sequencing errors
and frameshifts.


REFERENCES

Should you refer to the results of GENMARK analysis you should use 
the following reference:

Borodovsky M. (1990) Recognition of coding regions in nucleotide sequences.
   In M.F.Frank-Kamenetskii ed. Computer analysis of Genetic Texts, Nauka,
   Moscow. 

Borodovsky M. McIninch J. Prediction of Gene Locations Using DNA Markov Chain
   Models (Submitted to CABIOS). 


QUESTIONS, PROBLEMS, SUGGESTIONS

Please send any comments or questions that you might have about the software
or the method of coding region recognition to:

     mb56@hydra.gatech.edu  (Mark Borodovsky)

or

     gt1619a@hydra.gatech.edu  (James McIninch)


