IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

displaying alignments

Tom Schneider toms at fcs260c2.ncifcrf.gov
Tue Dec 21 18:19:26 EST 1993


In article <CIEGtt.2zu at mentor.cc.purdue.edu> westerm at aclcb.purdue.edu writes:
| Hum, all the alignment display programs that I would have recommended to 
| Michael Baron (BARON at AVRI.AFRC.AC.UK), he has already tried:
| 
|     -- PrettyBox
|     -- BoxShade
|     -- PrettyPlot
|     -- The prettyprint option within SeqApp
| 
| All of the programs are not to his liking; they all have faults. This
| is not, IMHO, uncommon amoung people who do protein alignments. I have
| yet to find a common method of doing protein consensus that satisfies 
| everyone. I can't speak for the other programs but the statement:

This is not surprising.

| >I HAVE ALREADY TRIED the GCG program Prettybox, but found that there 
| >are still some bugs in it, in that it makes some pretty strange 
| >decisions as to what is the consensus, and doesn't respond properly to 
| >the PLURALITY setting (which *should* allow one to say that one wants 
| >at least X residues to agree before there is a consensus).
| 
| Show a lack of understanding on how PrettyBox works -- perhaps not
| suprising since I've never written documentation for it. PrettyBox 
| does a consensus by 'voting' amoung the amino acids. The 'votes' are
| the scores as defined by the file 'prettypep.cmp'; said file can be
| changed if one doesn't like the scores. Whichever amino acid that
| gathers the most votes is considered to be the consensus amino acid. 
| Because of the default scores, this can sometimes lead to what some
| people would consider strange results. An example:
| 
| Given 5 aligned amino acids:  Y Y Y W W
| 
| What is the alignment? Some people would say 'Y' because 3 of the 5
| amino acids are 'Y'. 

There is no biological rationale for doing this.  You have taken perfectly good
sequence data and wrecked it!  Do you really want to destroy your hard-won
data?

| However, if you look at the scoring table and tally up votes, you will
| find that 'F' is the most common denominator. Why?
| 
| Y vs. Y has a score of 1.5
| Y vs. W has a score of 1.1
| Y vs. F has a score of 1.4
| W vs. F has a score of 1.3
| W vs. W has a score of 1.5
| 
| Looking at how the aligned AAs vote:
|   The 3 'Y's give 3 times 1.5 or 4.5 votes to a 'Y' consensus
|   The 3 'Y's give 3 times 1.1 or 3.3 votes to a 'W' consensus
|   The 3 'Y's give 3 times 1.4 or 4.2 votes to a 'F' consensus
|   
|   The 2 'W's give 2 times 1.5 or 3.0 votes to a 'W' consensus
|   The 2 'W's give 2 times 1.1 or 2.2 votes to a 'Y' consensus
|   The 2 'W's give 2 times 1.3 or 2.6 votes to a 'F' consensus
| 
| Totaling everything up:
| 
|   A 'Y' consensus receives 4.5 + 2.2 or 6.7 votes
|   A 'W' consensus receives 3.3 + 3.0 or 6.3 votes
|   A 'F' consensus receives 4.2 + 2.6 or 6.8 votes
| 
| So the 'F' consensus 'wins' and PrettyBox will shade all 5 aligned 
| amino acids as 'similar' but not 'identical'. 
| 
| More complex examples can be created but the process is the same.

All of this is making up a model that destroys your data.

| Aside from changing the score data file, there are a couple 
| of command line switches that can modify the scores. '/threshold'
| will keep low scoring amino acids from voting. '/plurality'
| will only consider consensuses that gather a minimum number of votes (note
| that this is not the same as saying 'X residues must agree', just that
| 'X number of votes must be gathered'). '/simplify' can sometime be
| useful by making similar amino acids act the same.
| 
| I am, slowly, working on another version of PrettyBox which will have
| even more command line switches that enable even finer control of the
| consensus algorithm. But even then, I suspect, not everyone will be
| satisfied.

Nobody will be satisfied because all consensus methods destroy data.  How about
using sequence logos instead?  Then at least you won't be destroying data at
each position.  (You would still lose correlations, but usually there are not
enough data to detect those with protein sequences anyway.)

@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
 and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}

See the FAQ sheet I just posted on bionet.info-theory for more details.
This archive contains information, the paper and example logos
in PostScript:

  ftp ncifcrf.gov
  anonymous
  (your user id)
  cd pub/delila
  get README
  get bionet.info-theory.Z     (there is one without .Z also)
  binary
  get logo.bbl.Z
  get logo.tex.Z
  get lambcro.logo.Z
  get globin.logo.Z
  get lexa.logo.Z
  get t7.logo.Z
  quit

Have a good end-of-year!  (Sorry, I won't be able to respond to questions until
after the 10th, but you could get help on bionet.info-theory).

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov




More information about the Bio-soft mailing list

Send comments to us at archive@iubioarchive.bio.net