GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP |
It was converted to a C program in early 1990's and after an update in 1993, remained as is for several years, as I wrote around it and moved thru C++ then on to Java as a primary bioinformatics language. During this time, I'd wanted often to teach readseq to handle sequence documentation; the original ignored all but a few fields of information other than sequence data. In late 1990's my sequence analysis program SeqPup was in its Java incarnation and needed this, especially handling of feature annotations, locating genes, introns and such in a sequence.
After slowly updating readseq from a haphazard C program to an object oriented structure in Java, I pulled it back out of its parent source to become again a stand-alone program for format conversion. This release version 2, first available in 1999, continues support for the "classic" C version, in that it supports the same command-line options, but has extensions for sequence documentation, feature table and other additions, plus new sequence format conversions, and a lot of bug fixing. This java version is also more efficient, working faster than the compiled C classic version. It still isn't efficient enough to handle large sequences (genome sized or full GenBank/EMBL data release files).
In its current Java incarnation, interfacing Readseq with other languages is done mainly through command-line calls to the main program. If your programs are in Perl, you may want to use the bioperl.org collection with its SeqIO package.
PUBLIC DOMAIN NOTICE:
This software is freely available to the public for use. The
author, Don Gilbert, of Readseq and the Java package
'iubio.readseq' does not place any restriction on its use or
reproduction. Developers are encourged to incorporate parts in
their programs. I would appreciate being cited in any work or
product based on this material. This software is provided without
warranty of any kind.
-- Don Gilbert
software@bio.indiana.edu, May 2001
Bioinformatics group,
Biology Department & Cntr. Genomics & Bioinformatics,
Indiana University, Bloomington, Indiana
The call format 'java -cp readseq.jar app' starts this simple graphic user interface, based on Java Swing. For MacOS, see the ReadseqApp driver program which does the equivalent to this commandline call. The menu options in this 'app' interface cover the equivalent of the command line options (see below).
The steps to use this are
File menu:
Read & reformat biosequences, Java command-line version Usage: java -cp readseq.jar run [options] input-file(s) For more details: java -cp readseq.jar help more Options -a[ll] select All sequences 'all' will cause processing of all sequences (default now for version 2, for compatibility with version 1). Use items=1,2,3 to select a subset. -c[aselower] change to lower case -C[ASEUPPER] change to UPPER CASE 'caselower' and 'CASEUPPER' will convert sequence case -degap[=-] remove gap symbols 'degap=symbol' will remove this symbol from output sequence (- normally) -f[ormat=]# Format number for output, or -f[ormat=]Name Format name for output see Formats list below for names and numbers 'format=genbank', 'format=gb', 'format=xml', et cetera to select an output format. You can also use format number (below), but these numbers may change with revisions. Alternate names of formats are listed below (e.g., 'Pearson|FastA|fa' allows 'pearson', 'fasta', or 'fa' as name. This is case-insensitive. -inform[at]=# input format number, or -inform[at]=Name input format name. Assume input data is this format 'inform=genbank' lets you specify data input format. Normally Readseq guesses the input format (usually correctly). Use this option if you wish to bypass this input format guessing. -i[tem=2,3,4] select Item number(s) from several 'items=2,3,4' will select these sequence records from a multisequence input file. -l[ist] List sequences only 'list' will list titles of sequence records -o[utput=]out.seq redirect Output 'output=file', send output to named file. -p[ipe] Pipe (command line, < stdin, > stdout) 'pipe' will cause input data to come from STDIN and output go to STDOUT unix standard files (unless -out is given and input file given), and no prompting or progress reports will occurr. -r[everse] reverse-complement of input sequence 'reverse' will write the sequence from end to start, and DNA bases are complemented. Amino residues are not complemented. -t[ranslate=]io translate input symbol [i] to output symbol [o] use several -tio to translate several symbols translates given sequence bases, e.g. -tAN to change 'A' to 'N' -v[erbose] Verbose progress 'verbose' will print some progress reports -ch[ecksum] calculate & print checksum of sequences Documentation and Feature Table extraction: -feat[ures]=exon,CDS... extract sequence of selected features -nofeat[ures]=repeat_region,intron... remove sequence of selected features 'feature=CDS,intron' lets you specify those features to extract, or remove, in the output. Currently this causes each feature to produce a new sequence record. Use the ' -field=AC,ID... include selected document fields in output -nofield=COMMENT,... remove selected document fields from output See below for subrange option -subrange=-1000..10 * extract subrange of sequence for feature locations -subrange=1..end -subrange=end-10..end+99 -extract=10000..99999 extract all features and sequence from given base range See below for pair, unpair options -pair=1 * combine features (fff,gff) and sequence files to one output -unpair=1 * split features,sequence from one input to two files Pretty format options: -wid[th]=# sequence line width -tab=# left indent -col[space]=# column space within sequence line on output -gap[count] count gap chars in sequence numbers -nameleft, -nameright[=#] name on left/right side [=max width] -nametop name at top/bottom -numleft, -numright seq index on left/right side -numtop, -numbot index on top/bottom -match[=.] use match base for 2..n species -inter[line=#] blank line(s) between sequence blocks This program requires a Java runtime (java or jre) program, version 1.1.x, 1.2 or later The leading '-' on option is optional if '=' is present. All non-options (no leading '-' or embedded '=') are used as input file names. These options and call format are compatible with the classic readseq (v.1992) * New experimental feature handling options, may not yet work as desired. To test readseq with a built-in data set, use: java -cp readseq.jar test
#!/bin/sh envtemp=/tmp/rseq$$.env env > ${envtemp} /usr/java/bin/java -cp readseq.jar cgi env=${envtemp} /bin/rm ${envtemp}The options in this 'cgi' interface include all of the command line options (see below). The 'env & ${envtemp}' is used to pass CGI environment variables to readseq. Included in the readseq.jar rez/ folder is the CGI form presented as 'cgiform.html'. If you wish to customize this, extract it from readseq.jar, edit, and put into the classpath ahead of readseq.jar, as
-- where this directory contains reaseq.jar and readseq.cgi -- jar -xf readseq.jar rez/cgiform.html -- edit rez/cgiform.html -- /usr/java/bin/java -cp .:readseq.jar cgi env=${envtemp}
The options for using readseq cgi are same as for readseq command line, with a few additions (see iubio.readseq.cgi java source).
One can currently extract sequence of a given set of features from a biosequence with feature tables.
Option: feature=gene,mRNA,repeat (feat for short) -- produce embl file gene features only java -cp readseq run format=embl out=flygenes.em feat=gene,mRNA fly-X.gbUpdates include extraction of proper strand orientation. E.g. extracting a feature with location 'complement(join(1..10,20..30))' produces the reverse-complement of bases 1..10 and 20..30
Option: subrange=#start..#stop -subrange=-1000..10 extract subrange of sequence for feature locations -subrange=1..end -subrange=end-10..end+99 -1000...10 is 1000 bases upstream/before to 10 in feature, 1..end is full feature range (default for no -subrange option) end-10..end+99 is 10 before end to 99 after end of feature only valid with -feat/-nofeatSequence Range Math
Readseq handles subranges as the intersection with a given feature location. Subrange math respects 'complement()' orientation, as well as the 'end' value to specify end of a given location. Subranges can be a join() of several ranges. The subrange is intersected with a given feature range, and extended beyond its ends as specified. Here are some subrange computation examples:
subrange(-100..10) -- 100 bases upstream of feature to 10 bases inside. subrange(1..end1) -- the full range of the feature. subrange(end1..end100) -- from the end to 100 bases downstream of the feature. subrange(-100..end100) -- 100 bases upstream to 100 bases downstream. subrange(join(-100..10,end1..end100)) -- 100 bases upstream to 10 into, joined with the end to 100 bases downstream. Test SeqRange.subrange() 10..60 & subrange(-100..10) = -91..19 10..60 & subrange(1..end1) = 10..60 10..60 & subrange(end1..end100) = 60..159 10..60 & subrange(-100..end100) = -91..159 10..60 & subrange(join(-100..10,end1..end100)) = join(-91..19,60..159) complement(10..60) & subrange(-100..10) = complement(51..161) complement(10..60) & subrange(1..end1) = complement(10..60) complement(10..60) & subrange(end1..end100) = complement(-89..10) complement(10..60) & subrange(-100..end100) = complement(-89..161) complement(10..60) & subrange(join(-100..10,end1..end100)) = complement(-89..10,51..161) join(1..5,20..40,55..60) & subrange(-100..10) = -100..5 join(1..5,20..40,55..60) & subrange(1..end1) = join(1..5,20..40,55..60) join(1..5,20..40,55..60) & subrange(end1..end100) = 60..159 join(1..5,20..40,55..60) & subrange(-100..end100) = join(-100..5,20..40,55..159) join(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100)) = join(-100..5,60..159) complement(1..5,20..40,55..60) & subrange(-100..10) = complement(55..161) complement(1..5,20..40,55..60) & subrange(1..end1) = complement(1..5,20..40,55..60) complement(1..5,20..40,55..60) & subrange(end1..end100) = complement(-98..1) complement(1..5,20..40,55..60) & subrange(-100..end100) = complement(-98..5,20..40,55..161) complement(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100)) = complement(-98..1,55..161) join(1000..1500,2000..2500,3000..3100) & subrange(-100..10) = 899..1009 join(1000..1500,2000..2500,3000..3100) & subrange(1..end1) = join(1000..1500,2000..2500,3000..3100) join(1000..1500,2000..2500,3000..3100) & subrange(end1..end100) = 3100..3199 join(1000..1500,2000..2500,3000..3100) & subrange(-100..end100) = join(899..1500,2000..2500,3000..3199) join(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100)) = join(899..1009,3100..3199) complement(1000..1500,2000..2500,3000..3100) & subrange(-100..10) = complement(3091..3201) complement(1000..1500,2000..2500,3000..3100) & subrange(1..end1) = complement(1000..1500,2000..2500,3000..3100) complement(1000..1500,2000..2500,3000..3100) & subrange(end1..end100) = complement(901..1000) complement(1000..1500,2000..2500,3000..3100) & subrange(-100..end100) = complement(901..1500,2000..2500,3000..3201) complement(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100)) = complement(901..1000,3091..3201)
ID | Name | Read | Write | Int'leaf | Features | Sequence | Suffix | Content-type |
---|---|---|---|---|---|---|---|---|
1 | IG|Stanford | yes | yes | -- | -- | yes | .ig | biosequence/ig |
2 | GenBank|gb | yes | yes | -- | yes | yes | .gb | biosequence/genbank |
3 | NBRF | yes | yes | -- | -- | yes | .nbrf | biosequence/nbrf |
4 | EMBL|em | yes | yes | -- | yes | yes | .embl | biosequence/embl |
5 | GCG | yes | yes | -- | -- | yes | .gcg | biosequence/gcg |
6 | DNAStrider | yes | yes | -- | -- | yes | .strider | biosequence/strider |
7 | Fitch | -- | -- | -- | -- | yes | .fitch | biosequence/fitch |
8 | Pearson|Fasta|fa | yes | yes | -- | -- | yes | .fasta | biosequence/fasta |
9 | Zuker | -- | -- | -- | -- | yes | .zuker | biosequence/zuker |
10 | Olsen | -- | -- | yes | -- | yes | .olsen | biosequence/olsen |
11 | Phylip3.2 | yes | yes | yes | -- | yes | .phylip2 | biosequence/phylip2 |
12 | Phylip|Phylip4 | yes | yes | yes | -- | yes | .phylip | biosequence/phylip |
13 | Plain|Raw | yes | yes | -- | -- | yes | .seq | biosequence/plain |
14 | PIR|CODATA | yes | yes | -- | -- | yes | .pir | biosequence/codata |
15 | MSF | yes | yes | yes | -- | yes | .msf | biosequence/msf |
16 | ASN.1 | -- | -- | -- | -- | yes | .asn | biosequence/asn1 |
17 | PAUP|NEXUS | yes | yes | yes | -- | yes | .nexus | biosequence/nexus |
18 | Pretty | -- | yes | yes | -- | yes | .pretty | biosequence/pretty |
19 | XML | yes | yes | -- | yes | yes | .xml | biosequence/xml |
20 | BLAST | yes | -- | yes | -- | yes | .blast | biosequence/blast |
21 | SCF | yes | -- | -- | -- | yes | .scf | biosequence/scf |
22 | Clustal | yes | yes | yes | -- | yes | .aln | biosequence/clustal |
23 | FlatFeat|FFF | yes | yes | -- | yes | -- | .fff | biosequence/fff |
24 | GFF | yes | yes | -- | yes | -- | .gff | biosequence/gff |
25 | ACEDB | yes | yes | -- | -- | yes | .ace | biosequence/acedb |
Option: pair-feature-seq (pair for short) Option: unpair-feature-seq (unpair for short) -- produce embl file from feature,sequence pair java -cp readseq run pair=1 format=embl out=flypair.em features-Xtop.tsv dna-Xtop.raw -- produce embl file of subrange from gene features using feature,sequence pair java -cp readseq run pair=1 format=embl out=flypairex.em feat=gene,mRNA subrange=-1000..10 features-Xtop.tsv dna-Xtop.raw -- produce fff,fasta file pair from embl sequence java -cp readseq run unpair=1 format=fff out=flyunpair.fff flypair.emThe pair option causes input sequence files to be checked and read as a pair of feature,sequence files. The unpair option will split an input feature+sequence file into paired feature, sequence files.
# flatfeat-version 2 Key Location Qualifiers source 1..1684 /organism="Drosophila melanogaster" ;/chromosome="3L" ; /map="69C" ; /ACCESSION="U57488" ... gene 18..1684 /gene="est-6" CDS join(18..1404,1456..>1684) /gene="est-6" ;... exon 18..1404 /gene="est-6" intron 1405..1455 /gene="est-6" exon 1456..>1684 /gene="est-6"Similar to GFF, FFF works better for me as it contains all information about a feature in one row. GFF splits gene, mRNA, other multi-location features among several rows, and currently lacks a standard syntax for grouping these feature parts.
##gff-version 2 # seqname source feature start end score strand frame attributes DMU57488 - source 1 1684 . + . organism "Drosophila melanogaster" ; chromosome "3L" ... DMU57488 - gene 18 1684 . + . gene "est-6" DMU57488 - CDS 18 1404 . + . group CDS_1 ; gene "est-6" ; ... DMU57488 - CDS 1456 1684 . + . group CDS_1 DMU57488 - exon 18 1404 . + . gene "est-6" DMU57488 - intron 1405 1455 . + . gene "est-6" DMU57488 - exon 1456 1684 . + . gene "est-6"
XML may be akin to a spreadsheet format for bioinformatics data; it is flexible enough to add new fields and content, in contrast to most other biosequence formats, excepting ASN.1. However, the data structure and field specifics for each XML implementation differ widely; just formatting data with angle bracket tags doesn't solve the problem of making it readable by software.
I don't know yet of an XML format for biosequence data that in general use, though some being used for genome annotation may well make it into Readseq's parser soon. For instance, TIGR's XML for genome annotation makes good sense, but it uses a different data structure than Readseq-XML, so requires a different software unit to parse.
Readseq 2 appears faster in limited testing than Bioperl's SeqIO Perl modules, by roughly a factor of 2 (using Java 1.3 and Perl 5.005 on a Sun Solaris system). Release 2 is substatially improved over release 1, and should provide more reliable results as well as the many improvements in documentation and feature handling. It retains command-line compatibility with release 1, so should be a drop-in replacement.
Readseq is currently not recommended for very large (100+MB) sequence files, whether as a single record or multiple records. Hopefully this limit will be lifted in coming months to enable use with genome-sized sequences and feature tables.
The main limitation is memory usage - it is not optimized for large data sets, but reads all of a sequence record in memory, generating numerous objects for documentation and a byte array for sequence.
Readseq started out as a component of a sequence analysis program, and continues as a component tool, which I use in various other programs. For use with other Java programs, add the readseq.jar to your library class path, then call Readseq public entries.
Here is one basic variant to convert A to B formats:
// compile as: javac -classpath readseq.jar:$CLASSPATH testrsq.java // run as: java -cp .:readseq.jar testrsq inputfile(s)... import java.io.*; import iubio.readseq.*; class testrsq { public static void main(String[] args) { try { int outid= BioseqFormats.formatFromName("fasta"); BioseqWriterIface seqwriter= BioseqFormats.newWriter(outid); seqwriter.setOutput( System.out); seqwriter.writeHeader(); Readseq rd= new Readseq(); for (int i=0; i<args.length; i++) { rd.setInputObject( args[i] ); if (rd.isKnownFormat() && rd.readInit()) rd.readTo( seqwriter); } seqwriter.writeTrailer(); } catch (Exception e) { e.printStackTrace(); } } } |
To read a sequence file for use in your program:
import java.io.*; import iubio.bioseq.*; import iubio.readseq.*; class testrsq2 { public static void main(String[] args) { try { // your data goes here Object inputObject= new FileReader("myseq.embl"); Readseq rd= new Readseq(); String seqname= rd.setInputObject( inputObject ); //Readseq.setInputObject() accepts many basic Java objects including // Reader, File, URL, InputStream, String, char[], byte[], // and an Enumeration of these objects System.err.println("Reading from "+seqname); if ( rd.isKnownFormat() && rd.readInit() ) { while (rd.readNext()) { BioseqRecord seqrec= new BioseqRecord(rd.nextSeq()); // do something with seqrec.... FeatureItem[] fits= seqrec.findFeatures( new String[]{"CDS", "intron"}); if (fits==null) System.out.println(" No such features found."); else { System.out.println(" Extracted features and their sequence"); for (int k= 0; k<fits.length; k++) try { System.out.println( fits[k]); Bioseq bseq= seqrec.extractFeatureBases( fits[k]); System.out.println(bseq); System.out.println(); } catch (SeqRangeException sre) { System.out.println(sre.getMessage()); } } } } } catch (Exception e) { e.printStackTrace(); } } } |
To write a sequence file given your program has sequence and sequence document objects:
import java.io.*; import java.util.Date; import iubio.bioseq.*; import iubio.readseq.*; class testrsq3 { public static void main(String[] args) { try { // your data goes here Bioseq seq = new Bioseq( "ccggagtgaggagcaacatgaactacgtgggactgggacttatcattgtgctgagctgcc"); BasicBioseqDoc seqdoc= new BasicBioseqDoc("DMU57488"); seqdoc.addDocField( seqdoc.kAccession, "U57488"); seqdoc.addComment("Just a test sequence"); seqdoc.addDate( new Date("February 11, 1997")); seqdoc.addSequenceStats( seq); seqdoc.addFeature("gene", new SeqRange("18..1684")); seqdoc.addFeatureNote("symbol", "est-6"); seqdoc.addFeature("intron", new SeqRange("1405..1455")); BioseqRecord seqrec= new BioseqRecord( seq, seqdoc); // now write to a file BioseqWriterIface writer= BioseqFormats.newWriter( BioseqFormats.formatFromName("XML")); writer.setOutput( new FileWriter("myseq.xml")) ; writer.writeHeader(); if (writer.setSeq( seqrec)) writer.writeSeqRecord(); else System.err.println("Failed to write " +seqrec); // or throw exception writer.writeTrailer(); writer.close(); } catch (Exception e) { e.printStackTrace(); } } } |
See in the source package for further example programs:
iubio.readseq.BioseqReaderIface iubio.readseq.BioseqWriterIfaceand then subclassing the iubio.readseq.BioseqFormat class, and adding the new iubio.readseq.BioseqFormat subclass to the list in rez/BioseqFormats.properties.
Find readseq source in ftp:, http://iubio.bio.indiana.edu/soft/molbio/readseq/java/ as readseq-source.zip
perl src/split4javac.pl -i rseq.mcp.xml -a src/split4javac.addlist -o splits/ >& log.split -- This reads project sources from .xml and creates output .java in split4/ It can also be run from MacPerl/BBEdit
find splits -name "*.java" -exec \ javac -classpath ibm-xml4j-min.jar:orgxml.jar -sourcepath splits -d splits {} \; -- if there are javac errors related to missing source classes, add the missing class to the src/split4javac.addlist and rerun split4javac.pl -- test as java -cp .:splits:ibm-xml4j-min.jar:orgxml.jar run
Version 2.1.3 (07 June 2001) updates: Added -reverse option for reverse-complement of sequence Feature extraction of complement() locations now does reverse-complement Added feature subrange extractions: -subrange=-1000..10 extract subrange of sequence for feature locations -subrange=1..end -subrange=end-10..end+99 -1000...10 is 1000 bases upstream/before to 10 in feature, 1..end is full feature range (default for no -subrange option) end-10..end+99 is 10 before end to 99 after end of feature only valid with -feat/-nofeat -extract=10000..99999 extract all features, sequence from given base range Added pair/unpair option for combining feature files and sequence files pair=1 myseq.gff myseq.fasta format=genbank == means combine features + sequence in one output file unpair=1 myseq.genbank format=fff == means output paired features, sequence files from one input Added compare=1 option to test differences in sequence files Added ClustalW alignment, AceDB sequence formats Added FFF/Flat-Feature-Format (one-line DDBJ/EMBL/GenBank Feature Table) Added GFF/General-Feature-Format Fixed EMBL to handle amino - SwissProt sequence format, GenBank to handle amino - GenPept A Perl script to convert readseq source to javac compatible form is included. Dropped obsolete/unsupported ZukerSeqFormat, OlsenSeqFormat, and reorganized format order (numbering) Various bug fixes; Java 1.2/3 compatibility To do: -- Need to handle files in +300 MB range for genomes -- add HTML output (feature hyperlinks? colorized features? ) -- test -pair and -unpair seq ( .fff + .seq, .gff + .seq) merge options