In article 4u4 at knot.queensu.ca, sibbald at qucis.queensu.ca (Peter Sibbald) writes:
>>re: random DNA or protein sequences.
>>The word "random" is a little vague. You can try any of the following
>depending on what questions you are asking:
>>1. generate sequences with probabilistically the same character
> frequencies as some real sequence. Seldom will the frequency
> in the generated sequence be exactly the same as in the real
>>2. "shuffle" an existing sequence so that the order changes but
> the character frequencies remain the same. a lot of programs
> do this kind of thing, GCG for example, if memory serves.
>>3. generate sequences with same single character frequencies as
> a real sequence AND the same adjacency frequencies (doublet
> frequencies). For example, the pair "qz" is rare in english,
> perhaps i generate it in a string with probability 0. This
> is just a Markov chain with a memory and of course the memory
> can vary (i.e. you can use triplets, 4-plets etc.).
>>4. generate all characters with equal likelihood.
I agree with all of the above. Random is more than a little vague. Without
context it has no real meaning. I think that what you need to ask BEFORE
you ask how, it what you want to use the "random" sequence for. I can, however
say from experience that the unix random() function does not generate values
that are really particularly random - the least significant (most?? I can't
quite remember) bit is an alternating sequence of length 2: 1010101010... I
tried it, and tossed it and went to more complicated ways of generating the
I was looking at, among other things, the viability of compression of DNA sequences
in a manner which was very fast, and easy to extract subsequences. Turns out
that encoding the sequence data using bit-shifting, 4 bases per byte, works
pretty well and provides a 4:1 compression (physical space, NOT theoretical
compression as the complete sequence is still directly present in the
representation using the same number of symbols as before..)
One thing that stood out from my experimentation is that if you compare the
"randomness" of a real DNA/RNA sequence with that generated by a relatively
simple random number generator, the actual DNA sequence is far more "random"
than the generated sequence. It's pretty difficult to get a generated sequence
that is as complex as a real one. The obvious exceptions to this are things like
telomeres which repeat with a repeated pattern length of only about 6 bases.
cprice at netserv.unmc.edu