More complex random sequences can be produces from a linear hidden
Markov model (HMM) of a family of sequences. One advantage of this
approach is that the probabilities of insertions and deletions are
modeled as well as the character probabilities for each column. If
you have a group of sequence similar to the random ones you desire,
you can train an HMM for the familiy, and then generate the sequences.
Both Sean Eddy's HMMER (http://genome.wustl.edu/eddy/hmm.html) and our
SAM system (http://www.cse.ucsc.edu/research/compbio/sam.html) have
programs for doing this. SAM has a WWW server for training HMMs and
performing multiple alignments and distance scoring based on the
model, but at the moment to do typical sequence generation, you'll
need to get a copy of the source code by sending email to
sam-info at cse.ucsc.edu.
In article <4bceuv$4u4 at knot.queensu.ca>, sibbald at qucis.queensu.ca (Peter Sibbald) writes:
|>|> re: random DNA or protein sequences.
|>|> The word "random" is a little vague. You can try any of the following
|> depending on what questions you are asking:
|>|> 1. generate sequences with probabilistically the same character
|> frequencies as some real sequence. Seldom will the frequency
|> in the generated sequence be exactly the same as in the real
|>|> 2. "shuffle" an existing sequence so that the order changes but
|> the character frequencies remain the same. a lot of programs
|> do this kind of thing, GCG for example, if memory serves.
|>|> 3. generate sequences with same single character frequencies as
|> a real sequence AND the same adjacency frequencies (doublet
|> frequencies). For example, the pair "qz" is rare in english,
|> perhaps i generate it in a string with probability 0. This
|> is just a Markov chain with a memory and of course the memory
|> can vary (i.e. you can use triplets, 4-plets etc.).
|>|> 4. generate all characters with equal likelihood.
|>|> etc. take your pick.
|>|> peter sibbald, sibbald at qucis.queensu.ca|>