#!/usr/bin/perl # overgo picking and design workbench # written by Arjun Prasd # aprasad@nhgri.nih.gov # run soop -h or perldoc soop for full documentation ######################################################################### ## ## ## PUBLIC DOMAIN NOTICE ## ## ## ## This software/database is ``United States Government ## ## Work'' under the terms of the United States Copyright Act. ## ## It was written as part of the authors' official duties for ## ## the United States Government and thus cannot be ## ## copyrighted. This software/database is freely available to ## ## the public for use without a copyright notice. ## ## Restrictions cannot be placed on its present or future ## ## use. ## ## ## ## Although all reasonable efforts have been taken to ensure ## ## the accuracy and reliability of the software and data, the ## ## National Human Genome Research Institute (NHGRI) and the ## ## U.S. Government does not and cannot warrant the ## ## performance or results that may be obtained by using this ## ## software or data. NHGRI and the U.S. Government disclaims ## ## all warranties as to performance, merchantability or ## ## fitness for any particular purpose. ## ## ## ## In any work or product derived from this material, proper ## ## attribution of the authors as the source of the software ## ## or data should be made, using: ## ## Thomas et.al., 2002 "Parallel Construction of Orthologous ## ## Sequence-Ready Clone Contig Maps in Multiple Species" Genome ## ## Res. 12:1277-85 ## ## as the citation. ## ## ## ######################################################################### ######################################################################### # $Id: soop,v 2.16 2004/02/03 20:46:41 aprasad Exp aprasad $ ######################################################################### # $Log: soop,v $ # Revision 2.16 2004/02/03 20:46:41 aprasad # BUG FIX: Big bug that lost every last pip in an alignment found and fixed # by Jim Thomas and Bob Sullivan. To quote Bob: # # Changes made were only in subroutine 'getPipFromSeq' where there was # no 'final' push to array after the 'for' loop ended. Also, because # the loop counter '$i' is now needed for checking after loop ends, I # declare that variable before the loop statement. # # BUG FIX: calculation of pip_end / pip_start coordinates in getPipFromSeq # was incorrect (+/- 1 base in many cases) # # Revision 2.15 2004/01/28 19:26:25 aprasad # BUG FIX: POD documentation fixed to reflect some of the recent changes # # Revision 2.14 2003/08/18 21:09:23 aprasad # BUG FIX: Removed some debugging code # # Revision 2.12 2003/08/18 14:22:44 aprasad # FEATURE: Added ability to use query coordinates for spacing probes # # Revision 2.11 2002/12/20 04:21:31 aprasad # FEATURE: added -v option to print out version number # # Revision 2.10 2002/08/23 21:37:50 aprasad # BUG FIX: Overgos sometimes contained N's when making in FASTA mode, fixed this. # FEATURE: Documentation changed to reflect full bibliographic information # of publication... (Yea!) # # Revision 2.9 2002/08/21 18:27:19 aprasad # BUG FIX: Added check for EOF to eliminate warnings for short input files # BUG FIX: Small documentation bug fixes # BUG FIX: Will no longer write out a fasta file if input file is an overgo # file # FEATURE: -s (start at) option now works for FASTA mode, offset starts at # that coordinate for every fasta entry (wierd, but I don't know # what else to do, this feature was a request). # # Revision 2.8 2002/07/19 14:44:31 aprasad # BUG FIX: Several accidental global variables switched to my'd # variables. # BUG FIX: Removed some debugging code accidentally left over. # FEATURE: Added tracking of overgo offset and made changes to # overgo making algorithm to track this instead of the # sequence. # FEATURE: Added seq_diff function to create diff format for fasta # FEATURE: Completely revised format of defline output in fasta files # to contain more information and be more compact and easily # read. # FEATURE: Updated POD documentation adding section on output file # formats to reflect these changes. # # Revision 2.7 2002/06/05 16:12:42 aprasad # BUG FIX: Updated POD documentation that had fallen behind current version # FEATURE: Now defaults to writing primer file (overgos.stj) in FASTA mode # FEATURE: By default the first "word" on the defline is used as the primer # prefix name when writing .stj file in FASTA mode # FEATURE: Small cosmetic changes to usage screen # # Revision 2.6 2002/03/01 19:23:57 aprasad # FEATURE: in fasta mode now writes fasta formatted overgos file by # default -p is now optional in that mode also # BUG FIX: Various minor cosmetic changes and some code cleanup # BUG FIX: getGCscore used to count ends of overgo (36mer) for end GC # optimization instead of ends of overlap portion. # # Revision 2.5 2001/11/30 19:18:50 aprasad # BUG FIX: Updated to remove warnings given with the -w switch # FEATURE: Some small cosmetic changes # # Revision 2.4 2001/11/01 22:29:56 aprasad # BUG FIX: Several minor bug fixes; this should be a stable version # FEATURE: New POD documentation and -h option to run perldoc soop # # Revision 2.3 2001/08/13 22:44:21 aprasad # FEATURE: Added overgo making from fasta files # BUG FIX: Numerous including bad results from getGCscore! # Completely rewrote pickOvergos to a simpler, cleaner solution # Need to do some testing because the picking may be different # # Revision 2.2 2001/08/07 21:24:00 aprasad # BUGGY Intermediate check-in # FEATURE: Adding ability to make overgos from a FASTA formatted file # In the process of a major code re-orginazation # # Revision 2.1 2001/07/06 14:50:43 aprasad # Feature: Added -p option to write out primers # # Several small cosmetic changes # # Revision 2.0 2001/05/31 17:14:42 aprasad # "Stable" version to pick from either blast2 or PipMaker verbose file output ######################################################################### # Revision 1.7 is a rewrite of large portions of the code. # Revisions 1.6 and 2.0 should be relatively stable # # Note to self: # '|' is hard coded into getPipFromSeq to be the character indicating # a match between bases ######################################################################### ################### Begin Main ###################### ######################################################################### require 5.6.0; use warnings; use strict; # for debugging my $logfilename = './logfile'; my $debug = 0; my $run_str = join(' ', $0, @ARGV); # set up all of the parameters # also prints Usage statement and dies if usage is not correct my %opts = getParams(); # allow for setting of debug option on command line $debug = $opts{debug} || $debug; writeLog("Begin: $run_str\nparseVerbose") if ($debug); my $pips; if ($opts{input_filetype} =~ /^BLAST/i) { $pips = parseBlast($opts{verbose_filename}); } elsif ($opts{input_filetype} =~ /^PipMaker/i) { $pips = parseVerbose($opts{verbose_filename}); } elsif ($opts{input_filetype} =~ /^FASTA/) { justMakeOvergosAlready(\%opts); exit 0; } writeLog("Verbose file parsed, now filter for possible overgos") if ($debug); # now make all possible overgos and eliminate # pips that cannot make overgos $pips = overgoFilter($pips, $opts{pick_from_query}, $opts{overgo_length}, $opts{overgo_overlap}, $opts{overgo_target_gc}, $opts{overgo_gc_wiggle}, $opts{identity_cutoff}); # if wanted, print all possible overgos to a file if ($opts{all_overgo_filename} ne '') { writeLog("Writing all possible overgos to $opts{all_overgo_filename}") if ($debug); writeOvergos($pips, $opts{all_overgo_filename}, $opts{pick_from_query}); } writeLog("Pips isolated, now pick") if ($debug); # it's scoring time! my $selected_pips = pickOvergos($pips, $opts{gap_length}, $opts{identity_cutoff}, $opts{ref_offset}, $opts{overgo_overlap}, $opts{overgo_target_gc}, $opts{coordinate_name}); writeLog("Write picked probes to $opts{overgo_filename}") if ($debug); writeOvergos($selected_pips, $opts{overgo_filename}, $opts{pick_from_query}); # now output the results printResults(\*STDOUT, $selected_pips, $pips, \%opts, $run_str); if ($opts{results_filename} ne '') { if (open(OUT, ">$opts{results_filename}")) { printResults(\*OUT, $selected_pips, $pips, \%opts, $run_str); } else { warn("Error opening $opts{results_filename}: $!\n"); } } if ($opts{primer_filename} ne '') { # make probes writePrimers($selected_pips, $opts{primer_filename}, $opts{primer_prefix}, $opts{overgo_overlap}); } writeLog(' ########################## End #############################') if ($debug); ######################################################################### ################# End Main Subroutines follow ################### ######################################################################### ######################################################################### # justMakeOvergosAlready -- just takes a FASTA file and makes overgos # from it # INPUTS: hash containing the minimum elements: # verbose_filename -- the input FASTA formatted file # primer_filename -- the output primer formatted file # primer_prefix -- prepend to the primer name # overgo_length -- the length of the overgo # overgo_overlap -- the overlap between primers # overgo_target_gc -- the target GC proportion # overgo_gc_wiggle -- the allowable variation from the target # GC proportion # # OUTPUTS: writes a primer file with the best overgos # returns a summary of what it did ######################################################################### sub justMakeOvergosAlready { my $opts = shift; my $seqs = parseFasta($$opts{verbose_filename}); my @overgos; my ($overgo, $offset); foreach my $seq (@$seqs) { ($overgo, $offset) = makeOvergoFromSeq($$seq{seq}, $$opts{overgo_length}, $$opts{overgo_overlap}, $$opts{overgo_target_gc}, $$opts{overgo_gc_wiggle}, $$opts{ref_offset}); if ($overgo) { $$seq{overgo} = $overgo; # fake out output producers... ugly kludge alert $$seq{og_offset} = $offset; $$seq{pip_length} = length($$seq{seq}); $$seq{ref_start} = 0; $$seq{ref_end} = $$seq{pip_length}; $$seq{contig_name} = $$seq{defline}; $$seq{og_id} = 0; } } if ($$opts{primer_filename}) { writePrimers($seqs, $$opts{primer_filename}, $$opts{primer_prefix}, $$opts{overgo_overlap}); } if ($$opts{overgo_filename} && $$opts{input_filetype} !~ /overgos/) { writeOvergos($seqs, $$opts{overgo_filename}, 0); } # now report the results (should probably split into another sub) { my $num_og = 0; printf("%-49s %s\n", 'Beginning of defline', 'GC percent'); foreach my $seq (@$seqs) { my($name) = ($$seq{defline} =~ /^(\S+)/); if (exists $$seq{overgo}) { $num_og++; my $gc = ($$seq{overgo} =~ tr/GCgc//) / $$opts{overgo_length}; printf("%-50s %-02.1f\n", $name, $gc * 100); } else { printf("%-50s Couldn't make overgo\n", $name); } } print "\nMade $num_og overgos out of " . @$seqs . " FASTA entries\n"; } } ######################################################################### # parseFasta -- parses fasta files and returns an array of hashes of # defline => 'defline', seq => 'ACTG' ######################################################################### sub parseFasta { my $filename = shift; my $defline; # holder for current defline my $seq; # holder for current sequence my @seqs; # array containing sequences open(IN, "$filename") || die "Couldn't open $filename: $!"; while ($_ = ) { next if (/^$/); chomp; # check for new contig if (/^>\s?(.*)$/) { push @seqs, { defline => $1 }; } else { $seqs[$#seqs]{seq} .= $_; } } close IN; return \@seqs; } ######################################################################### # writePrimers - write out primers to a file ######################################################################### sub writePrimers { my $overgos = shift; my $filename = shift; my $prefix = shift; my $overlap = shift; my $number = 1; open(OUT, ">$filename") || die "Couldn't write $filename: $!"; foreach my $overgo (@$overgos) { next unless($$overgo{overgo}); my ($primer1, $primer2) = primerMaker($$overgo{overgo}, $overlap); my $name; if ($prefix) { $name = sprintf('%s%02d', $prefix, $number++); } else { $$overgo{defline} =~ /^([A-Za-z0-9-_.]+)/; $name = $1; } print OUT "$name.1\t$primer1\n$name.2\t$primer2\n"; # if ($prefix) { # $name = sprintf('%s%02d', $prefix, $number++); # } else { # $name = $$overgo{name} = # print OUT "$name.1\t$primer1\n$name.2\t$primer2\n"; # } } close OUT; } ######################################################################### # printResults - print out a summary of pips chosen and such things # # INPUTS: $selected_pips -- array of chosen overgos # $all_pips -- array of all possible overgos # $opts -- array of options. ######################################################################### sub printResults { my $fh = shift; my $selected_pips = shift; my $all_pips = shift; my $opts = shift; my $run_str = shift; my $num_pips = 0; my $last_pip; my ($tot_og_id, $tot_pip_id, $tot_dist, $tot_len, $tot_gc); print $fh "% $run_str\n\n"; print $fh " Coordinates Dist. Len. \%id(og) %id(pip) %GC\n"; foreach my $pip (@{$selected_pips}) { next if (! exists $$pip{overgo}); my $GC = ($$pip{overgo} =~ tr/GC//) / $$opts{overgo_length}; my $dist; my $coords; if ($$opts{coordinate_name} eq 'query_start') { $coords = "($$pip{query_start}-$$pip{query_end})"; $dist = abs($$pip{query_start} - $$last_pip{query_start}) if ($num_pips); } else { $coords = "($$pip{ref_start}-$$pip{ref_end})"; $dist = abs($$pip{ref_start} - $$last_pip{ref_start}) if ($num_pips); } if (!$num_pips++) { $dist = 0; } printf($fh "%-17s %6d %5d %5.1f %5.1f %2.0f\n", $coords, $dist, $$pip{pip_length}, $$pip{og_id} * 100, $$pip{identity} * 100, $GC * 100); $last_pip = $pip; $tot_og_id += $$pip{og_id}; $tot_pip_id += $$pip{identity}; $tot_dist += $dist; $tot_len += $$pip{pip_length}; $tot_gc += $GC; } print $fh "----------------------------------------------------------------------\n"; printf($fh "%-17s %6d %5d %5.1f %5.1f %2.1f\n", " Mean:", $tot_dist / (($num_pips - 1)||1), $tot_len / $num_pips, $tot_og_id / $num_pips * 100, $tot_pip_id / $num_pips * 100, $tot_gc / $num_pips * 100); printf($fh "Picked %d out of %d possible overgos\n", $#$selected_pips + 1, $#$all_pips + 1); } ######################################################################### # getParams - gets all options from the command line or keeps defaults # dies if required options are not included # # uses Getopt::Std and reads @ARGV # defaults can be modified here # returns hash of options and values # # THIS LIST IS OUT OF DATE... JUST LOOK BELOW # pick_from_query - 0 to make overgo from reference sequence # 1 to make from query # identity_cutoff - minimum proportion identical bases for pip # ref_offset - offset from begining of reference sequence at which to # begin looking for pips # gap_length - optimal spacing between overgos # overgo_length - the length of the overgo probe to design # overgo_overlap - the overlap between primers for the probe # overgo_target_gc- the target GC percentage to go for # overgo_gc_wiggle- the maximum difference from the overgo_target_gc # for an overgo # debug - write debugging messages to stdout and logfile # overgo_filename - file to write picked overgos to # all_overgo_filename - file to write all possible overgos to (defaults # to none) ######################################################################### sub getParams { my %opts; my $filetype; use Getopt::Std; getopts('ri:s:g:l:o:t:w:da:f:p:n:hS:vQ', \%opts); # some minimal sanity checking of options my $errstr; if (!$opts{h} and !$opts{v}) { $errstr .= "overgo length must be even (was -l$opts{l})\n" if ($opts{l} && ($opts{l} % 2 or $opts{l} < 2)); $errstr .= "overlap must be even (was -o $opts{o})\n" if ($opts{o} && ($opts{o} % 2 or $opts{o} < 1)); $filetype = getFileType($ARGV[0]); $errstr .= "$filetype\n" if ($filetype =~ /^ERROR/); if ($filetype =~ /FASTA/) { foreach my $opt (keys %opts) { if ($opt !~ /[ifpnlogws]/) { $errstr .= "-$opt is not a valid option in FASTA mode\n"; } } if (!exists $opts{p}) { $opts{p} = 'overgos.stj'; } } } if ($errstr) { print <<"END_USAGE"; Usage: soop [options] The input file must be either PipMaker verbose, standard BLAST2 (requires NHGRI::blastall.pm) or FASTA (make overgos, no picking) -h print a much longer help message (uses perldoc) -r Make probe from reference (Sbjct) sequence (default is to make from the query sequence) -Q use the query coordinates for spacing instead of reference -i Proportion minimum identity for probes to be made [.80] -f Write overgos file (defaults to overgos) -p Write overgo primers file (overgos.stj in FASTA mode else not) -a Saves all possible overgos (not written by default) -S Write summary information to file (not written by default) -n Name to prepend to overgo name (defaults to first word of defline) -s Start making probes at this offset in bases [0] -g Optimal spacing between probes in kb [30] -l Overgo length [36] -o Overlap between primers [8] -t Target gc proportion for probes [.5] -w Amount GC proportion can vary (wiggle) from the target [.06] END_USAGE print "\n$errstr" if ($errstr); die("\n"); } elsif ($opts{h}) { exec('perldoc', $0); die "the -h option requires the perldoc script to be in your path"; } elsif ($opts{v}) { my @r=(q$Revision: 2.16 $=~/\d+/g); printf("soop version: %d."."%02d"x$#r."\n",@r); exit; } return ( pick_from_query => $opts{r} ? 0 : 1, coordinate_name => $opts{Q} ? 'query_start' : 'ref_start', identity_cutoff => $opts{i} || .80, ref_offset => $opts{s} || 0, gap_length => $opts{g} || 30, overgo_length => $opts{l} || 36, overgo_overlap => $opts{o} || 8, overgo_target_gc => $opts{t} || .5, overgo_gc_wiggle => $opts{w} || .06, debug => $opts{d} || 0, verbose_filename => $ARGV[0], input_filetype => $filetype, overgo_filename => $opts{f} || 'overgos', primer_filename => $opts{p} || '', # if '' then don't write all_overgo_filename => $opts{a} || '', # if '' then don't write results_filename => $opts{S} || '', # ditto primer_prefix => $opts{n} || '', ); } ######################################################################### # getFileType - determines the type of a file (ie either BLAST2 or # PipMaker # INPUTS: $filename - file to open and try to determine type of # OUTPUTS: returns a string describing the file type # 'BLAST2' - a standard blast output # 'PipMaker' - a PipMaker verbose file # 'FASTA' - a plain FASTA formatted file # 'FASTA-overgos' - a FASTA file written by this program # 'ERROR: ...'- if file doesn't exist or can't be opened # the specific error will follow the : ######################################################################### sub getFileType { my $filename = shift || ''; my $filetype = 0; my $fasta = 0; # return "ERROR: File doesn't exist" unless (-e $filename); -f $filename or return "ERROR: File \"$filename\" does not exist"; open FILE, $filename or return "ERROR: $! opening $filename"; # print "$filename: $!\n"; # read a few lines to check my $line; for(my $i = 0; $i < 5 && !eof(FILE); $i++) { $line = ; $filetype = 'BLAST2' if ($line =~ /^BLASTN 2/); $filetype = 'PipMaker' if ($line =~ /^Seq 2 = "/); $filetype = 'FASTA-overgos' if ($line =~ /^>.* ### pip\:\[ref\:/); last if ($filetype); $fasta = 'IMPOSSIBLE' if ($line !~ /^>\w+|^[ACTGXNactgxn\w]*$/); } close FILE; if ($filetype) { return $filetype } elsif ($fasta ne 'IMPOSSIBLE') { return 'FASTA'; } else { # if we've gotten here I don't know what kind of file it is return 'ERROR: File type unknown'; } } ######################################################################### # parseVerbose - read and parse the verbose file for pips # calls getPipFromSeq # # INPUTS: $filename -- the name of the verbose file # OUTPUTS: returns \@pips -- the address of an array of hashes of pips # see getPipFromSeq for format of this array ######################################################################### sub parseVerbose { my $filename = shift; # set up some variables that need to be scoped to this sub my($gap_indicator) = ' '; # character that indicates a gap in sequence my( $contig_name, # the defline of the query contig (according to pipmaker/blastz) $new_contig_name,# the contig_name for the next hit $ref_start, # starting coordinates of ref sequence $ref_end, # ending coordinates of ref sequence $ref_seq, # the reference sequence for this hit $query_start, # see above $query_end, # $query_seq, # $ref_line_read, # was the last line we read from the reference sequence? [0,1] $tick_line_read, # was the tick line (from PipMaker 0 . : . :) $cmp_str # the line containing the comparisons (ie. || |---|:|) ); my @pips; open(VERBOSE, "<$filename") || die "Couldn't open $filename: $!"; while($_ = ) { # skip blank lines next if (/^$/); # check for new contig if (/Seq 2 \= "\>(.+)"$/) { # new contig begins here $new_contig_name = $1; } elsif (/Begins at \((\d+),(\d+)\) and Ends at \((\d+),(\d+)\)/) { # line that gives coordinates my($new_ref_start, $new_query_start, $new_ref_end, $new_query_end) = ($1, $2, $3, $4); if ($ref_seq && $query_seq) { # if this is not the first hit (ie. we have sequence) # get all the pips from this hit if ($debug) { writeLog("Got $contig_name"); writeLog("($ref_start, $query_start) -> ($ref_end, $query_end)"); } my $query_orient = ($query_start < $query_end) ? '+' : '-'; getPipFromSeq(\@pips, $contig_name, $ref_seq, $ref_start, '+', # orient always + $query_seq, $query_start, $query_orient, $gap_indicator); } # initialize all the hit variables $contig_name = $new_contig_name; ($ref_start, $query_start, $ref_end, $query_end) = ($new_ref_start, $new_query_start, $new_ref_end, $new_query_end); $ref_line_read = 0; $tick_line_read = 0; $ref_seq = ''; $query_seq = ''; $cmp_str = ''; # Bug Fix... # needed to take care of cases where the tick line is to short to contain # '.' or ':' as well as allow for cases where there is a big gap and no # sequence for the reference line (otherwise the ref_line is wrong) } elsif (!$tick_line_read && !$ref_line_read && /^\s*\d+ {1,5}[ :.]*$/) { $tick_line_read = 1; } elsif (!$ref_line_read && /^\s*\d+ ([CcAaTtGgNnMmXx ]+)$/) { $ref_line_read = 1; $ref_seq .= $1; } elsif ($ref_line_read && /^ {8}([:| -]+)$/) { $cmp_str .= $1; } elsif ( $ref_line_read && /^\s*\d+ ([CcAaTtGgNnMmXx ]+)$/) { $ref_line_read = 0; $tick_line_read = 0; $query_seq .= $1; } } # END while($_ = ) { close(VERBOSE); # now take care of the last hit if ($debug) { writeLog("Got $contig_name"); writeLog("($ref_start, $query_start) -> ($ref_end, $query_end)"); } my $query_orient = ($query_start < $query_end) ? '+' : '-'; getPipFromSeq(\@pips, $contig_name, $ref_seq, $ref_start, '+', $query_seq, $query_start, $query_orient, $gap_indicator); if ($debug) { foreach my $pip (@pips) { writeLog($$pip{contig_name} . "\n" . $$pip{identity} . "\n" . $$pip{pip_length} . " Ref: ($$pip{ref_start}-$$pip{ref_end})\n" . "Query: ($$pip{query_start}-$$pip{query_end})\n" . "QuerySeq: $$pip{query_seq}\n\n"); } } # now dump the data to a file so it can be used for testing the next thing # in line (The actual picking algorithm) # open OUT, ">piplist.vi" or die "Couldn't create piplist.vi: $!"; # use Data::Dumper; # print OUT Dumper(\@pips); return \@pips; } ######################################################################### # parseBlast - read and parse basic blast output for ungapped alignments # calls getPipFromSeq # uses Bio::SearchIO to read the blast report (from bioperl project) # # INPUTS: $_[0] - $filename - the name of the blast output file # OUTPUTS: returns \@pips -- the address of an array of hashes of pips # see getPipFromSeq() for the format ######################################################################### sub parseBlast { my $filename = shift; my (@pips, $contig_name, $subject_start, $subject_orient ); my $gap_indicator = '-'; unshift(@INC, '.'); require Bio::SearchIO; shift(@INC); my $searchio = new Bio::SearchIO(-format => 'blast', -file => $filename); while (my $result = ($searchio->next_result)) { while (my $hit = $result->next_hit) { my $contig_name = $hit->name . ' ' . $hit->description; while (my $hsp = $hit->next_hsp) { # I have to do this because my blastz parser is brain dead # and returns a start coordinate greater than an end coordinate # when reverse complemented. (Maybe I should change it instead) my $ref_start = ($hsp->strand('hit') == 1) ? $hsp->start('hit') : $hsp->end('hit'); # this is probably unnecessary since blast files generally # revcomp the subject sequence, not the query. I'm including # it for completeness, just in case. my $query_start = ($hsp->strand('query') == 1) ? $hsp->start('query') : $hsp->end('query'); getPipFromSeq(\@pips, $hit->name . ' ' . $hit->description, $hsp->hit_string, $ref_start, ($hsp->strand('hit') == 1) ? '+' : '-', $hsp->query_string, $query_start, ($hsp->strand('query') == 1) ? '+' : '-', $gap_indicator); } } } return \@pips; } ######################################################################### # getPipFromSeq -- takes in the sequence from a PipMaker verbose file hit # (and eventually a blast hit) # returns a @pips array of format: # $pips[i]{'contig_name'} => the name of the starting contig # 'ref_start' => starting coordinate in the reference sequence # 'ref_end' => ending coordinate in reference sequence # 'query_start' => starting coordinate in query sequence # 'query_end' => ending coordinate in query sequence # 'ref_seq' => reference pip sequence # 'query_seq' => query pip sequence # 'identity' => the proportion identity between sequences: a ratio ie. .95 (for 95%) # 'pip_length' => length of non-gapped alignment # # The algorithm I use is somewhat ugly and basically a hack, if speed is a # concern it should be re-written. -- Arjun 12/22/00 # # It was rewritten! -- Arjun 3/14/01 ;-) # # Then it was rewritten again, this time to work with both blast2 and # blastZ output # # I still don't like it; it's slow. But its easy to understand (at # least for me.) ######################################################################### sub getPipFromSeq { my $pips = shift; # array to contain pips my $contig_name = shift; # ie. def line from query sequence my $ref_seq = uc shift; # reference/subject sequence my $ref_start = shift; my $ref_orient = shift; # '+' or '-' (- is for reverse compliment) my $query_seq = uc shift; # query sequence my $query_start = shift; my $query_orient = shift; # '+' or '-' my $gap_indicator = shift; # the character that indicates a gap in # the sequences # initialize some loop variables my $ref_pip = ''; # holds the current reference/subject pip my $query_pip = ''; # holds the current query pip my $pip_start = 0; # holds the offset of the current pip start my $ref_gaps = 0; # gaps in the ref/subj seq (to get correct # coordinates) my $query_gaps = 0; # gaps in the query seq my $num_identical = 0; # the number of identical bases per pip my $i = 0; # loop counter - needed after end of loop # the actual coordinates of the pip start my ($ref_pip_start, $ref_pip_end, $query_pip_start, $query_pip_end); for($i = 0; $i < length($ref_seq); $i++) { # character holders (hold one base) my $r = substr($ref_seq, $i, 1); my $q = substr($query_seq, $i, 1); # DEBUG temporary debugging stuff!!! if (!$r or !$q) { warn "ERROR on \$r=$r, \$q=$q, \$i=$i because of parsing error\n" . "Contig: $contig_name\n" . "Ref: $ref_start\n" . "Query: $query_start\n"; } # END DEBUG if ($q =~ tr/ACTG// and $r =~ tr/ACTG//) { $ref_pip .= $r; $query_pip .= $q; $num_identical++ if ($r eq $q); } else { if ($ref_pip) { if ($query_orient eq '-') { # query is rev. comp. $ref_pip_start = $ref_start + $pip_start - $ref_gaps; $ref_pip_end = $ref_start + $i - $ref_gaps - 1; $query_pip_start= $query_start - $i + $query_gaps + 1; $query_pip_end = $query_start - $pip_start + $query_gaps; } elsif ($ref_orient eq '-') { # ref is rev. comp. $ref_pip_start = $ref_start - $i + $ref_gaps + 1; $ref_pip_end = $ref_start - $pip_start + $ref_gaps; $query_pip_start= $query_start + $pip_start - $query_gaps; $query_pip_end = $query_start + $i - $query_gaps - 1; } else { # neither is rev. comp. $ref_pip_start = $ref_start + $pip_start - $ref_gaps; $ref_pip_end = $ref_start + $i - $ref_gaps - 1; $query_pip_start= $query_start + $pip_start - $query_gaps; $query_pip_end = $query_start + $i - $query_gaps - 1; } push(@$pips, { contig_name => $contig_name, ref_start => $ref_pip_start, ref_end => $ref_pip_end, query_start => $query_pip_start, query_end => $query_pip_end, ref_seq => $ref_pip, query_seq => $query_pip, pip_length => length($ref_pip), identity => $num_identical / length($ref_pip) } ); # now re-initialize some variables $num_identical = 0; $ref_pip = ''; $query_pip = ''; } if ($r eq $gap_indicator) { $ref_gaps++; } elsif ($q eq $gap_indicator) { $query_gaps++; } $pip_start = $i; } } # end of 'for' loop thru ref seq ######################################################################### ## start of code added for bug fix by Bob Sullivan ## (Code is just copied from break-processing inside for loop.) ## Need a final 'push' onto array at end of ref sequence. ######################################################################### if ($ref_pip) { # This used to +/- by 2, changed it to 1, since a seq starting # at bp 3 that is 4 bp long ends at 6 (3 + 4 - 1) if ($query_orient eq '-') { # query is rev. comp. $ref_pip_start = $ref_start + $pip_start - $ref_gaps; $ref_pip_end = $ref_start + $i - $ref_gaps - 1; $query_pip_start= $query_start - $i + $query_gaps + 1; $query_pip_end = $query_start - $pip_start + $query_gaps; } elsif ($ref_orient eq '-') { # ref is rev. comp. $ref_pip_start = $ref_start - $i + $ref_gaps + 1; $ref_pip_end = $ref_start - $pip_start + $ref_gaps; $query_pip_start= $query_start + $pip_start - $query_gaps; $query_pip_end = $query_start + $i - $query_gaps - 1; } else { # neither is rev. comp. $ref_pip_start = $ref_start + $pip_start - $ref_gaps; $ref_pip_end = $ref_start + $i - $ref_gaps - 1; $query_pip_start= $query_start + $pip_start - $query_gaps; $query_pip_end = $query_start + $i - $query_gaps - 1; } push(@$pips, { contig_name => $contig_name, ref_start => $ref_pip_start, ref_end => $ref_pip_end, query_start => $query_pip_start, query_end => $query_pip_end, ref_seq => $ref_pip, query_seq => $query_pip, pip_length => length($ref_pip), identity => $num_identical / length($ref_pip) } ); } ######################################################################### ## end of code fix by Bob Sullivan ######################################################################### } ######################################################################### # overgoFilter - get all possible overgos # # INPUTS: $pips_in (array of hashes containing at least: # query_seq (and/or ref_seq) -- the sequence from which # to design the primer) # $pick_from_query -- 0 to pick from reference sequence # 1 to pick from query sequence # $overgo_length -- length of overgo to make # $overlap -- amount each overgo primer should overlap # $target_gc -- the target GC ratio for the overgos # $wiggle_room -- the amount the GC ratio can vary # OUTPUTS: returns a reference to an array of hashes for each pip from # which an overgo could be made containing everything passed # in through $pips_in and 'overgo' key/value containing the overgo # sequence. # - Now including element 'og_id' which is the proportion # identical for the overgo itself (2/21/01) # - Now including og_offset which is the offset from the # start of the pip (7/16/02) ######################################################################### sub overgoFilter { my($pips_in) = shift; my($pick_from_query) = shift; my($overgo_length) = shift; my($overlap) = shift; my($target_gc) = shift; my($wiggle_room) = shift; my($identity_cutoff) = shift; my(@overgos); # array of hashes of pip and overgo information # with all pips from which we can make overgos writeLog("Begin overgoFilter with " . $#$pips_in . " pips") if ($debug); foreach my $pip (@$pips_in) { if (length($$pip{ref_seq}) >= $overgo_length) { ($$pip{overgo}, $$pip{og_id}, $$pip{og_offset}) = makeOvergoFromComp( $pip, $pick_from_query, $overgo_length, $overlap, $target_gc, $wiggle_room); # take this one if it has an identity that's good enough. if ((defined $$pip{og_id}) && ($$pip{og_id} > $identity_cutoff)) { writeLog("Got overgo: \@$$pip{ref_start} $$pip{overgo}" . sprintf(' %.1f% ', $$pip{og_id} * 100) ) if ($debug); push @overgos, $pip; } } } writeLog("End overgoFilter with $#overgos overgos\n") if ($debug); if(! exists($overgos[0])) { die "No qualifying overgos (with length > $overgo_length)" } return \@overgos; } ######################################################################### # pickOvergos - picks which of all possible overgos to actually use # # INPUTS: $overgos -- array of hashes of overgos each array element # must contain at minimum the keys # 'identity' -- the percent identity of the pip # 'ref_start' -- the reference start coordinate # or # 'query_start' -- the query start coordinates # (depends on the '-Q' option) # # $gap_length -- optimal spacing between probes in (kb) # $identity_cutoff -- the minimum percent identity to make # probes from # $start_at -- the reference sequence to start making overgos at # # OUTPUTS: returns a reference to an array of picked # overgos with each element copied from the $overgos array # passed in. ######################################################################### sub pickOvergos { my $og_in = shift; # pointer to array of pip/overgo hashes my $gap_length = shift() * 1000; # because gaps are in kb my $identity_cutoff = shift; # the minimum percent identity my $start_at = shift; # the offset to start making overgos from my $overlap = shift; my $target_gc = shift; my $loc = shift; #my $loc = 'ref_start'; my $og_length = length($$og_in[0]{overgo}); my $id_wiggle = .03; # the amount the proportion identity can vary to be # considered equivelent. my @og_out; # array to contain all overgos to be made my $i = 0; # index of the currently examined overgo my $i_best; # first make sure to get them so we can pick more easily @$og_in = sort { $a->{$loc} <=> $b->{$loc} } @$og_in; # first make sure we're past the offset to start looking while ($$og_in[$i]{$loc} <= $start_at) { $i++; } $i_best = $i++; # now find the best one in $gap_length after that while ($i <= $#$og_in && $$og_in[$i]{$loc} <= $start_at + $gap_length) { if ($$og_in[$i]{og_id} > $$og_in[$i_best]{og_id}) { if ($$og_in[$i]{og_id} > $$og_in[$i_best]{og_id} + $id_wiggle) { $i_best = $i; # if the identity isn't vastly better than maybe the GC content is } elsif (getGCscore($$og_in[$i]{overgo}, $overlap, $og_length, $target_gc) < getGCscore($$og_in[$i_best]{overgo}, $overlap, $og_length, $target_gc)) { $i_best = $i; } } $i++; } push @og_out, $$og_in[$i_best]; writeLog("First one is at overgo $i_best") if ($debug); # now that we've got the first one picked lets pick the rest while ($i <= $#$og_in) { # make sure we skip at least $gap_length / 2 while ($i <= $#$og_in && $$og_in[$i]{$loc} < ($og_out[-1]{$loc}) + $gap_length / 2) { $i++; } $i_best = $i++; while ($i <= $#$og_in && $$og_in[$i]{$loc} <= $og_out[-1]{$loc} + $gap_length * 1.5) { if ($$og_in[$i]{og_id} > $$og_in[$i_best]{og_id}) { if ($$og_in[$i]{og_id} > $$og_in[$i_best]{og_id} + $id_wiggle) { $i_best = $i; # if it's not _way_ better than see if it has a better location } elsif (abs($$og_in[$i]{$loc} - $og_out[-1]{$loc} - $gap_length) < abs($$og_in[$i_best]{$loc} - $og_out[-1]{$loc} - $gap_length) ) { $i_best = $i; } } $i++; } writeLog ("Picked ".$$og_in[$i_best]{overgo}.'@'.$$og_in[$i_best]{$loc}) if ($debug); push @og_out, $$og_in[$i_best] if ($i_best <= $#$og_in); $i = $i_best + 1; } die "No qualifying overgos of identity > $identity_cutoff" if (!@og_out); return \@og_out; } ######################################################################### # writeOvergos - writes overgos to a file in FASTA format # # INPUTS: $_[0] - array of overgos containing at minimum # 'contig_name' => the name of the starting contig # 'ref_start' => starting coordinate in the reference sequence # 'ref_end' => ending coordinate in reference sequence # 'query_start' => starting coordinate in query sequence # 'query_end' => ending coordinate in query sequence # 'og_id' => the proportion identity between sequences: a ratio ie. .95 (for 95%) # 'pip_length' => length of non-gapped alignment # 'overgo' => sequence of actual overgo to make (ie the 36 bases) # OUTPUTS: writes to $filename ($_[1]) in a FASTA format the fields # listed above. ######################################################################### sub writeOvergos { my $overgos = shift; my $filename = shift; my $pick_from_query = shift; # labels for the defline my($ref, $query); if ($pick_from_query) { $ref = 'ref'; $query = 'Query'; } else { $ref = 'Ref'; $query = 'query'; } writeLog("Writing overgos to file") if ($debug); open OUT, ">$filename" or die "Couldn't write to $filename: $!"; no warnings; foreach my $pip (sort {$a->{ref_start} <=> $b->{ref_start}} @{$overgos}) { ### Old defline format # printf OUT ">%s %s-%s <--> %s-%s Len: %3d Id: %.1f", $$pip{contig_name}, # $$pip{ref_start}, $$pip{ref_end}, $$pip{query_start}, $$pip{query_end}, # $$pip{pip_length}, $$pip{og_id} * 100; # print OUT "%\n"; ### new defline format my $diff; if (exists $$pip{query_seq}) { if ($pick_from_query) { $diff = seq_diff($$pip{overgo}, substr($$pip{ref_seq}, $$pip{og_offset}, length($$pip{overgo}))); } else { $diff = seq_diff($$pip{overgo}, substr($$pip{query_seq}, $$pip{og_offset}, length($$pip{overgo}))); } } printf OUT (">%s ### pip:[$ref:(%d-%d) $query:(%d-%d) id:%s] og:[id:%s offset:%d diff:%s]\n", $$pip{contig_name}, $$pip{ref_start}, $$pip{ref_end}, $$pip{query_start}, $$pip{query_end}, sprintf("%1.1f", $$pip{identity} * 100) . '%', sprintf("%1.0f", $$pip{og_id} * 100) . '%', $$pip{og_offset}, $diff ); printf OUT "$$pip{overgo}\n"; } close OUT; } ######################################################################### # makeOvergoFromComp # attempts to find the overgo with the best identity secondarily # optimizing for GC content/structure within the constraints # imposed by the target_gc, the overgo length, the overlap # and the "wiggle_room" which is the proportion the probe can vary # from the ideal "target_gc" # *** Assumes even numbers for overgo_length and overlap *** # # Algorithm: # The sequence is broken up into every possible overgo # If the overgo is within the allowable GC percentage range # Then score based on percent identity picking the highest identity # For those with the same identity take the one that maximizes # the following score. # Score it by summing: # - the absolute value of the target GC - proportion GC of # the left side unique # - the absolute value of the target GC - proportion GC of # the overlap region # - the absolute value of the target GC - proportion GC of # the right side unique # and the lowest score wins # for two scores that are equal, the one that has G's or C's # on the ends of the overlap region wins # Just takes the first one with the best score and returns it # If none qualify within the allowable GC proportion range then # returns 0 # # getGCscore() contains ideas based on an algorithm from overgo_maker40 # by John D. McPherson at Washington University, St. Louis # # - Arjun Prasad # # INPUTS: sequence to make overgo from # total length of overgo # overlap between primers # target GC content (ie. .5 for 50% GC) # variation room around target GC content # String containing identity information: # Same length as the sequence containing # '|' for each match and something else for each # missmatch # OUTPUTS: returns - sequence of total overgo length in 5'->3' order, # - a proportion identical score (max 1.0), # - overgo_offset the offset from the start of the pip # returns undef if no overgo is within the limits ######################################################################### sub makeOvergoFromComp { my $pip = shift; my $make_from_query = shift; my $overgo_len = shift; my $overlap = shift; my $target_gc = shift; my $wiggle_room = shift; my $unique_len = $overgo_len / 2 - $overlap / 2; my $best_overgo; my $best_id = 0; #bug??? my $best_GC_score = 0; for(my $i = 0; $i < length($$pip{query_seq}) - $overgo_len; $i++) { # get the sequence for the current overgo my $overgo; if ($make_from_query) { $overgo = substr($$pip{query_seq}, $i, $overgo_len); } else { $overgo = substr($$pip{ref_seq}, $i, $overgo_len); } # first check to see if overall GC percentage qualifies if (abs($target_gc - ($overgo =~ tr/GC//) / $overgo_len) < $wiggle_room) { # Can't do this anymore because Blastall.pm doesn't return the # comparison line. # my $id = (substr($$pip{cmp_str}, $i, $overgo_len) =~ tr/|//) / # $overgo_len; # here comes the uglyer replacement way to determine the percent # identity... I'll fix this later I hope - Arjun my $id = 0; for my $n (0 .. $overgo_len - 1) { $id++ if (substr($$pip{query_seq}, $i + $n, 1) eq substr($$pip{ref_seq}, $i + $n, 1) ); } $id = $id / $overgo_len; if ($id > $best_id) { # if better identity then best so far $best_id = $id; $best_GC_score = getGCscore($overgo, $overlap, $overgo_len, $target_gc); $best_overgo = $i; } elsif ($id == $best_id) { # if equal identities but better GC score my $gc = getGCscore($overgo, $overlap, $overgo_len, $target_gc); if ($gc < $best_GC_score) { $best_id = $id; $best_GC_score = $gc; $best_overgo = $i; } } } } # Make sure we've got a good overgo otherwise return undef if ($best_id != 0) { if ($make_from_query) { return (substr($$pip{query_seq}, $best_overgo, $overgo_len), $best_id, $best_overgo); } else { return (substr($$pip{ref_seq}, $best_overgo, $overgo_len), $best_id, $best_overgo); } } else { return (undef, undef); } } ######################################################################### # getGCscore -- gets an OvergoMaker40 style GC content score for an # overgo. # # idea for algorithm from overgomaker40 by John D. McPherson at Washington # University, St. Louis # # INPUTS: $overgo - sequence of overgo # $overlap - overlap between complimentary probes # $overgo_len - total length of the overgos # $target_gc - target GC proportion (usually .5) # OUTPUTS: returns GC score, lower is better ######################################################################### sub getGCscore { my $overgo = uc shift; my $overlap = shift; my $overgo_len = shift; my $target_gc = shift; my $unique_len = $overgo_len / 2 - $overlap / 2; my $seq = substr($overgo, 0, $unique_len); my $score = abs($target_gc - ($seq =~ tr/GC//) / $unique_len); $seq = substr($overgo, $unique_len, $overlap); $score += abs($target_gc - ($seq =~ tr/GC//) / $overlap); $seq = substr($overgo, -$unique_len); $score += abs($target_gc - ($seq =~ tr/GC//) / $unique_len); # GC content of the edges of the overlap region is better if it's a G # or a C, but this should be less important than the GC content of # the three sections as added above. # K&R says that a float should have at least 6 significant digits. # (Damn, that was a long time ago!) $score -= ( (substr($overgo, $unique_len, 1) =~ tr/GC//) + (substr($overgo, -$unique_len-1, 1) =~ tr/GC//)) * 1E-6; return $score; } ######################################################################### # getIdFromStr - returns the percent identity for a sequence using # the comparison line from a blast/PipMaker output # counting gaps as non-aligned bases # INPUTS: $cmp_str -- comparison string to use, # format is anything except a '|' is considered # a mismatch so a string of '||| :' would return # an identity of .6 (as in .6 of the bases are the # same) # OUTPUTS: returns a float of the proportion of bases that align from # the input string ######################################################################### #sub getIdFromStr { # my $cmp_str = shift; # $cmp_str && return ($cmp_str =~ tr/|//) / length($cmp_str); #} ######################################################################### # primerMaker - turns overgo sequence into two complimentary primers # for labeled extension # *** assumes even numbers for length($sequence) and overlap *** # INPUTS: $sequence, $overlap # OUTPUTS: Returns a two element array of strings containing # the two primers of length $sequence / 2 + $overlap / 2; ######################################################################### sub primerMaker { my($sequence) = uc shift; my($overlap) = shift; my($primer_length) = length($sequence) / 2 + $overlap / 2; my($template_primer) = substr($sequence, 0, $primer_length); # grab the end of the sequence and reverse it my($rev_comp_primer) = join('',reverse(split(//, substr($sequence, -$primer_length)))); # now convert to complimentary bases $rev_comp_primer =~ tr/ATCG/TAGC/; return ($template_primer, $rev_comp_primer); } ######################################################################### # writeLog - write entry to log file (mostly for debugging) # # INPUTS: $_[0] contains the string to write # $logfilename (just grabs a global) # OUTPUTS: writes to file named by $logfilename (global) ######################################################################### sub writeLog { open LOG, ">>$logfilename" or warn "Error opening $logfilename: $!"; print LOG join('', @_) . "\n"; close LOG; } ######################################################################### # makeOvergoFromSeq # attempts to find the best possible overgo within the constraints # imposed by the target_gc, the overgo length, the overlap # and the "wiggle_room" which is the proportion the probe can vary # from the ideal "target_gc" # *** Assumes even numbers for overgo_length and overlap *** # # Algorithm: # The sequence is broken up into every possible overgo # If the overgo is within the allowable GC percentage range # Score it by summing: # - the absolute value of the target GC - proportion GC of # the left side unique # - the absolute value of the target GC - proportion GC of # the overlap region # - the absolute value of the target GC - proportion GC of # the right side unique # and the lowest score wins # for two scores that are equal, the one that has G's or C's # on the ends of the overlap region wins # Just takes the first one with the best score and returns it # If none qualify within the allowable GC proportion range then # returns undef # # - Arjun Prasad ######################################################################### sub makeOvergoFromSeq { my($sequence) = uc $_[0]; my($overgo_length) = $_[1]; my($overlap) = $_[2]; my($target_gc) = $_[3]; my($wiggle_room) = $_[4]; my($start_at) = $_[5]; # the unique sequence at each side of the DNA (assumes sll even numbers) my($unique_length) = $overgo_length / 2 - $overlap / 2; my($length) = length($sequence); # total sequence length my($best_gc_score) = 10000; # _really_ high number so everything will be # lower/better than it # (I know... it's ugly... sorry) my($best_og_offset) = -1; # the index of the best overgo my $best_og = ''; for my $i ($start_at .. $length - $overgo_length) { my $og = substr($sequence, $i, $overgo_length); # make sure we've only got ACTG next if (($og =~ tr/ACTG//) < $overgo_length); # make sure the GC is within acceptable ranges next if (abs($target_gc - ($og =~ tr/GC//) / $overgo_length) > $wiggle_room); my $gc_score = getGCscore($og, $overlap, $overgo_length, $target_gc); if ($gc_score < $best_gc_score) { $best_gc_score = $gc_score; $best_og_offset = $i; $best_og = $og; } } if ($best_og_offset >= 0) { # return (substr($sequence, $best_og_offset, $overgo_length), return ($best_og, $best_og_offset); } else { return undef; } # # for each overgo it is possible to make... # for ($i = 0; $i <= ($length - $overgo_length); $i++) { # # first check to see if GC percentage qualifies # my($overgo) = substr($sequence, $i, $overgo_length); # if (abs($target_gc - ($overgo =~ tr/GC//) / $overgo_length) < $wiggle_room) { # if ($debug) { print "Overgo qualifies: $overgo\n" } # # now check the left unique portion # my($seq) = substr($overgo, 0, $unique_length); # my($overgo_score) = abs($target_gc - ($seq =~ tr/GC//) / $unique_length); # # now check the overlap portion # $seq = substr($overgo, $unique_length, $overlap); # $overgo_score += abs($target_gc - ($seq =~ tr/GC//) / $overlap); # my($end_gc) = ((substr($seq,-1) =~ tr/GC//) + (substr($seq, 0, 1) =~ tr/GC//)); # # now check the right unique portion # $seq = substr($overgo, -$unique_length); # $overgo_score += abs($target_gc - ($seq =~ tr/GC//) / $unique_length); # # is this $overgo_score better than the last one # if (($overgo_score < $best_overgo_score) # || ( ($overgo_score == $best_overgo_score) # && ($best_overgo_end_gc < $end_gc) )) { # if ($debug) { print "New High Score = $overgo_score\tEnd GC = $end_gc\n"; } # $best_overgo = $overgo; # $best_overgo_score = $overgo_score; # $best_overgo_end_gc = $end_gc; # } else { # if ($debug) { print "."; } # } # } else { # if ($debug) { print "Overgo doesn't qualify\n"; } # } # } # # return the best overgo, if there isn't one that qualifies # # return 0 # return $best_overgo; } ######################################################################### # seq_diff - create a formatted directional comparison between two # sequences from which you can reconstitute both original sequences # INPUTS: $seq1, $seq2 (should only contain nucleotides or [_- ] # OUTPUTS: returns the seq_diff formatted string ######################################################################### sub seq_diff { my $seq1 = uc(shift); my $seq2 = uc(shift); my $diff; $seq1 =~ tr/ -/_/; $seq2 =~ tr/ -/_/; for (my $i = 0; $i < length($seq1); $i++) { my $char1 = substr($seq1, $i, 1); if ($char1 ne substr($seq2, $i, 1)) { $diff .= "$char1>" . substr($seq2, $i, 1); } else { $diff .= $char1; } } return $diff; } ######################################################################### # END PROGRAM -- BEGIN POD DOCUMENTATION ######################################################################### =head1 NAME B - system for optimized overgo picking =head1 SYNOPSIS S B<[options]> IPipMaker/BLAST filenameE>> S B<[options]> B<-p> Iprimer filenameE> IFASTA filenameE>> =head1 DESCRIPTION Soop designs "overgo" DNA hybridization probes either from a sequence comparison or a straight FASTA file. The basic idea behind designing probes from sequence comparisons is that orthologous sequences between two species that are highly concerved are likely to also be conserved in a third species. Soop runs in these two modes depending on the type of file passed to it as the last parameter on the command line. =over 4 =item FASTA If a FASTA formatted file is passed to B then the sequences are converted to overgos optimizing for GC content. The only relevant options are: B<-f> B<-p> B<-i> B<-n> B<-l> B<-o> B<-t> B<-w> =item NCBI BLAST or PipMaker Verbose If a sequence comparison output is sent to B it will try to design and pick overgos based on spacing between probes and sequence homology for use in comparitive species probe design. All options are relevant. =back =head1 OPTIONS =over 4 =item B Either PipMaker verbose or standard BLAST 2.0 (requires NHGRI::blastall.pm) to pick overgos from species comparison data or FASTA to just make overgos. This must be the last argument when calling B =item B<-a> I "All overgos" File containing the overgo from every pip that meets the GC content requirements optimizing first for sequence similarity and second for GC content. This file is not written by default. =item B<-f> I File containing the selected overgos only maximizing percent identity and spacing then GC content in BLAST or PipMaker modes. In FASTA mode simply contains all the overgos that could be made maximizing for GC content. Defaults to "overgos" =item B<-g> I The ideal spacing between probes in kilobases. Soop trys to make the probes from ungapped alignments this far apart. Defaults to I<30>. =item B<-h> Print out this message (requires C in your path) and exit. (Overrides all other options) =item B<-i> I Proportion minimum identity for probes to be made. In other words if B is .80 no overgo will be made with less than 80% similarity between the two sequences. Defaults to I<.80>. =item B<-l> I Length in bases of overgo to design. If B is 36 then the overgos will be made of two primers who's final lengths post extension will be 36 base pairs. Defaults to I<36>. =item B<-n> I Prefix for names of primers in .stj (primers) file. If this option is present the primers will be named ##.1 and ##.2 for each primer with the ## incrementing by one from 00. If a primer file is written and this option is not specified the primer names will be the first "words" on the defline that match the character class [A-Za-z0-9_-.]. =item B<-o> I Overlap in bases between the two primers. Defaults to I<8> =item B<-p> I Name of file to write primers in (In "Send to Jackie" format). When making overgos from a FASTA formatted file this is the primary output filename. Not written by default. =item B<-Q> Use the Query sequence for probe spacing. Otherwise uses the reference sequence to determine probe spacing. If you're blasting your reference sequence against a library of other sequences then use this option. =item B<-r> Make probes/overgos from the "reference" or "subject" sequence. This does not affect which coordinates are used for setting probe spacing. By default the probes are made from the Query sequence. =item B<-s> I Start at this offset (in bases). This option isn't used very much, it describes an offset to start making probes at. In other words if you wanted to make some probes but skip the first 100kb of reference sequence you would set this to 100000. Deafults to I<0>. =item B<-S> I Write summery information about the probes chosen to this file. Not written by default. =item B<-t> I Target proportion GC for the overgo as a whole. Defaults to I<.5> =item B<-v> Print the version number and exit. (Overrides all other options except B<-h>) =item B<-w> I Wiggle room for the GC proportion. In other words the amount the GC content can vary from the ideal GC content; if B<-t> was set to .5 and B<-w> was set to .06 than the GC content of the overgo would have to be in the range of 44% to 56% GC. Defaults to I<.06> =back =head1 OUTPUT FILE FORMATS =over 4 =item overgo file The overgo files (those written by the B<-f> and B<-a> options) are written in FASTA Format. With the defline in the following format: =back > ### pip:[ref:(-) query:(-) \ id:] og:[id:% offset: diff:] =over 4 =item The defline of the query sequence if in a comparison mode (BLAST or PipMaker). The defline of the fasta entry if in FASTA mode. =item pip:[ref:(-) query:(-) id:<%id>%] The characteristics of the pip that this overgo was made from. Start and ending coordinates in the reference/subject sequence and start and end coordinates in the query sequence. If in FASTA mode the reference coordinates (ref:(#-#)) will be 0 to the length of the fasta entry. Either "ref" or "query" will be capitalized to indicate which sequence the overgo was made from. =item og:[id:<%id> offset: The characteristics of the overgo itself. <%id> is the percent identity of the overgo alone. is the number of bases from the first base of the ungapped alignment or contig that the overgo starts. =item diff: This is a sequence comparison between the sequence of the overgo and the other sequence. From this you can retrieve either the overgo sequence or the sequence it was compared to. The format is the overgo sequence with a > after any base that changes where _ indicates a gap on either side. For example the following aligned sequences: ACTG TCAGA || |-:|-|| ACGGTCC GA would be represented "ACT>GG_>TCA>_GA" The format can be converted to the sequence of the overgo with the regular expression s/\>[ACTG]|_//g or to the sequence it was compared to with the regular expression s/[ACTG]\>|_//g. =item primer file Format of the primer file is simply the name of the primer then the sequence of that primer. Primer names are the overgo name with a .1 or .2 appended depending whcih strand the oligo is from. For example 4href03.1 GCTGGGTGTGCCTGGGGATCAT 4href03.2 GTTTTAACAATAAAATGATCCC =back =head1 USAGE This is a brief summary of how we use B. =over 4 =item 1 The first thing to do is to put together a "finished" ungapped reference sequence that will be used as a basis for the comparison and location estimation for other species. This should be your best estimate of the true finished sequence for the reference organism (human in our case). =item 2 Get comparison sequence from another species. We have used both finished mouse sequence from BACs that align to our reference sequence as well as now we are using mouse whole genome shotgun traces downloaded from Ensembl's web site (http://www.ensembl.org) that align to golden path coordinates similar to ours. =item 3 Mask the repeats from both sequences. We use RepeatMasker (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker) to mask both the reference and the other species sequences. =item 4 Run an NCBI blast (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/) or PipMaker (http://bio.cse.psu.edu/pipmaker/) with a database of the reference sequence and a query of the other species. If you run PipMaker You'll need the Verbose text output (The one the web site says you don't want or need). =item 5 Finally we get to running B. You'll need to run B using either the BLAST2 standard text output or PipMaker verbose file. If you just want the default parameters B can be run as just C I>. =item 6 You I then re-blast the overgos file that B produces against appropriate databases to insure that it hasn't designed low complexity or repetitive probes. You should make sure that all high value hits make sense as coming from one position in a given genome. If you have to drop probes at this point you can use the B<-a> option to find nearby replacements. =back =head1 FILES Uses Bio::SearchIO to parse blast reports. Available at http://www.bioperl.org =head1 VERSION soop $Revision: 2.16 $ $Date: 2004/02/03 20:46:41 $ UTC =head1 BUGS Files with long stretches of NNNNNN's that span more than one line cause the PipMaker text file parser (parseVerbose) to barf and give incorrect data (subject becomes query). Will overwrite input file in FASTA mode without asking permission. This is especially insidious when the input filename is "overgos". (note to self: should be fixed in getParams()) Should have offset (B<-s>) or "start at" option for FASTA mode. =head1 AUTHOR Written by Arjun Prasad (aprasad@nhgri.nih.gov) Some ideas taken from overgomaker by John D. McPherson (overgo GC content optimization) =head1 SEE ALSO =over 4 =item BioPerl (Bio::SearchIO used to parse blast output) http://www.bioperl.org =item RepeatMasker http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker =item PipMaker http://bio.cse.psu.edu/pipmaker/ =item NCBI Blast 2 http://www.ncbi.nlm.nih.gov/BLAST/ Stand alone blast (recommended): ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ =item Ensembl http://www.ensembl.org =back =head1 COPYRIGHT PUBLIC DOMAIN NOTICE This software/database is ``United States Government Work'' under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (NHGRI) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose. In any work or product derived from this material, proper attribution of the authors as the source of the software or data should be made, using: I as the citation. =cut