GeneJockey Brief instructions _______________________________________________________ GeneJockey A program for editing, manipulating and analysing nucleic acid and protein sequences DEMO VERSION. Please note that the demo version will not save or print sequences (it can, however print this 'Read Me' window). The global clipboard is disabled, so data cannot be transferred into or out of the program. The demo version can open only GeneJockey files and Genbank files (NOT text files, DNA inspector files or IBI Pustell files). The communications routines are also disabled. None of these restrictions apply to the full version. Introduction GeneJockey is a program for the Apple Macintosh which makes full use of the machine's WIMP interface. Nucleic acid and protein sequences can be entered from the keyboard or imported as files in other formats (Currently the program can read or write nucleic acid files in the format of Textco's DNA Inspector and IBI's Pustell programs. Text files corresponding to other formats can be read, and Genbank files in Macintosh format can be opened ). The program displays both sequence data and annotations in the same window, and permits multiple windows to be open (up to 30) so that sequences can be assembled from several sources by standard Macintosh editing techniques. Sequences entered by hand may be verified by re-typing or by having the machine read the sequence back, using sampled sounds rather than speech synthesis for clarity. Standard procedures for manipulating sequences are provided : both types of sequence may be reversed, and nucleic acid sequences may be complemented, inverted or translated to protein. In each case a new window is opened to contain the modified sequence, the original remaining unchanged. Basic database searching facilities are provided, and you may search your own GeneJockey files or Genbank files or both at once. You may search the comments section of the files with keywords entered at the keyboard, or search the sequence sections using the sequence currently displayed in the front window. In either case, the program makes a crude estimate of the time required to complete the search before starting. Standard search facilities are provided for searching a single sequence with a short sub-sequence, and the availability of wild card characters allows consensus sequences to be searched for. Sequence analysis commands include : Align:(Simple - display the sequences in the two top windows aligned so that the longest possible single match is obtained, or Exhaustive - stretch the sequences where necessary to display as much homology as possible between them). In either case the displayed sequences remain fully editable, enabling sequence fragments read from sequencing gels to be adjusted to obtain the best contig in the simplest possible manner. Matrix : Dot-plot matrix between two sequences. Reading Frames - Display all open reading frames longer than a threshold length, optionally starting only at a start codon and reading both strands of the nucleic acid. Selected areas of open reading frame can then be translated to protein. Hydropathicity Plot : The hydropathicity profile of protein sequences may be displayed using either Kyte and Doolittle or Hopp and Woods methods. Sequence Information : Displays base counts, molecular weights etc for both a selected section of sequence and for the entire sequence. Restriction Analysis : The program contains recognition sequences for about 400 restriction enzymes (You can change these and add more if necessary). DNA sequences may be analysed using a preselected list of restriction enzymes drawn from these, and the results displayed in text form either as individual digests, as a multiple digest, or as a graphic map. The program can automatically select those enzymes which can cut (or not cut) within a selected section of the sequence. Requirements The program requires a minimum of a Macintosh Plus. In order to make use of the Genbank sequence database a hard disk is required, and since this database currently occupies about 30 Mbyte a large disk is necessary if you want to have the whole of Genbank available. If the program is to be used for extensive database searching a faster machine is recommended to reduce waiting time. Using the program General Points Windows. GeneJockey uses several different types of window. The basic type is the sequence window. You may have many sequence windows open at once, but any operations which you carry out will only affect the top (active) window. All operations other than editing and tidying cause a new window to be opened in which the results of the operation are displayed. The new window may be another sequence window, or a window in which text or graphics may be displayed. Each window is a separate entity, and its contents may be printed or saved to disk. Options. Many of the commands which you issue by means of the pull-down menus have optional parameters and start with a parameter dialog in which you can set these. The default values for these parameters in most cases will perform adequately. If you find it necessary to change the defaults every time you use the command, you can use a 'Set Default' checkbox in to make the changes permanent. If you never change the default vaues you can turn the parameter dialogs off (and on again) with the 'Always Show Parameter Dialogs' command in the Edit menu. Selection Links. Several of the analysis windows opened by GeneJockey feature 'selection links' back to the original sequence. This permits you to identify the part of the sequence corresponding to a feature in the analysis window. For example, when you create a hydropathicity plot of a protein showing an interesting hydrophobic region, you can select it by dragging the cursor across it. If you then switch back to the original sequence you will find the corresponding area of the protein is selected. Similarly, open reading frames can be traced from an ORF window back to the DNA sequence, and areas selected within a Dot Matrix will be identified in both the original sequences. Selection links persist only for the life of the original windows: if you save and re-open these windows the selection links will no longer exist. Sequence Editor Sequence windows contain two text boxes, each with its own scroll bar. The upper box is used for comments, the lower one for the sequence itself. Only one of these boxes is active at any time (indicated by the presence of the insertion point). All standard Macintosh editing features are supported, including Cut, Copy, Paste, Clear and Undo. When typing in a sequence, the program automatically inserts a space every ten residues, and a return every fifty. Comments are simply plain text in any format you wish to use. There are two types of sequence window, for nucleic acid or protein sequences, but they operate in an identical manner, the only visible difference being the 'First nt :' or 'First AA' prompt at centre left. The 'New' command on the File menu is hierarchical, offering you the choice DNA, Protein or plain text window (like the window in which this text is displayed). All nucleic acid sequences should be entered as DNA (When entering RNA sequences you should substitute T forU). You may use the IUPAC codes for indeterminate residues (use the 'Show Wildcards' command from the Edit menu, or type any illegal character to display them). Protein sequences should be entered in single-letter code. When active, both sequence windows display continuously the position of the cursor at centre left. This makes it very easy to locate a particular residue by number simply by mousing around. Three buttons, located between the text boxes permit common operations to be performed with a single mouse click. The left-hand button permits the sequence to be identified as either linear or circular. The button toggles between the two settings, and the label on it indicates the current setting. The centre button permits you to set the origin of the sequence (i.e. the way in which it is numbered). For linear sequences the origin is the number assigned to the first residue, which by convention may not be zero. GeneJockey permits the origin to be set anywhere between -32768 and +32767, excluding zero. For circular sequences the program displays residue number one at top left, and if you change the origin, the residue which you select by typing in its number becomes the origin and the whole sequence is rotated, re-numbered and re-displayed. Take care when using this command, as there is at present no Undo for it. The right-hand button, labelled 'Tidy up', re-formats the sequence, inserting spaces and carriage returns in the right places. If you simply type in a sequence in order, the program will format it correctly as you go along, however, when inserting or deleting residues in the middle of an existing sequence it becomes irritating to have to wait for the program to sort out the format after every key stroke or editing command, so the program will permit a little temporary disorder here. When you have finished the current operation use the 'Tidy up' button to restore regularity. Sequences entered from the keyboard may be checked by one of two methods. The command 'Verify by Typing' (Edit menu) allows you to re-type the sequence, checking that the sequence is the same the second time round. To use it, first set the insertion point at the beginning of the sequence (or before that part of the sequence which you wish to check), then issue the command. The first residue of the sequence will be highlighted. Now re-type the sequence. As you type each character, the next character of the sequence becomes highlighted. If you type a character which is different from that which is currently highlighted, the machine gives an audible warning, and the highlighting does not move on. In order to escape from this mode to edit the wrong character which you have found you should hold down the Command key and type a full stop (.). Re-issue the 'Verify by typing' command (or use the command-T equivalent) to re-start the verification process. Another way to verify sequences is to have the program read the sequence back to you. The command 'Verify by speaking' causes the program to read the sequence starting at the current insertion point, and stopping after every block of ten residues. To continue to the next block, type any character. You may move quickly around the sequence by means of the arrow keys; right and left arrows move forward or back by one block, up and down arrows move up or down a line respectively. As usual Command-full stop aborts the process. You may also select 'Speak on entry' from the edit menu. When this option is selected the program will speak the letter whenever you press a key, allowing an audible check when entering sequences. Sequences stored on disk are accessed by means of the normal Macintosh 'Open' command. The Open dialog shows GeneJockey files, TEXT files, files written by 'DNA Inspector' or by IBI Pustell programs and GenBank Index files. Nucleic acid sequence files in any of these formats (Except Genbank) open in a straightforward way. In the case of simple TEXT files the program will open a text window, and it's up to you to transfer the sequence and comments into the appropriate sequence window by means of Copy and Paste. In the case of Genbank files, these are large files containing too many sequences to display in a single list, so to open them you should first open the section index of the section which contains the sequence you want (Don't use the general index, it won't work). This leads to a new display showing sequence loci (i.e. their names) in a scrolling alphabetical list on the left, with a pop-up menu next to the word 'Section' above. If the sequence you require is not in the list, look through the sections available on the pop-up menu to obtain the section which you need. Use 'Open' (or double click in the usual way) to open the sequence that you need. GeneJockey treats Genbank files as read-only, so you cannot save any changes you make, unless you use 'Save as...' to save the file as a GeneJockey file under a new name. Files may be saved in any of the above formats (except the compressed format in which Genbank files come). When saving, the save dialog offers several options, selected by buttons at the bottom of the dialog. You may save files in GeneJockey format (Any GeneJockey window), as DNA inspector files (Nucleotide sequence windows only) or Genbank/IBI files (Nucleotide or protein sequence windows only), or as plain TEXT files (Nucleotide, protein sequence or text windows only). Searching and Finding The Find menu allows a sequence to be searched for a pattern of bases typed in at the keyboard, and a collection of GeneJockey and Genbank files to be searched for matches with a sequence or for keywords in the comments section. The Find > In Sequence command leads to a dialog in which you can enter a short section of sequence to search for. The sequence to be searched for is entered in the text box at the top. When searching DNA sequences you may wish to use the IUPAC codes which act as 'wild cards', giving a match with more than one base. The program will search the sequence of the front window starting at the insertion point until it finds a match, then return command to you with the matching section of sequence highlighted. Having once used the Find command the program remembers the sequence which you typed in, and you can search again for subsequent matches by means of the Find same command. If the search reaches the end of the sequence without finding a match the program beeps. If you do not expect the search sequence to match exactly, you can specify the maximum number of mismatches, and yo can also use the 'Find Mismatches' button to determine the minimum number of mismatches necessary to find a match. The Search Database commands cause a number of files to be searched consecutively. The files may be GeneJockey sequence files of either type and/or Genbank files in compressed Macintosh format. The extent of the search is determined by the way in which the files are laid out in their folders. You start a search by choosing a starter file, which may be any of the types of file which can be searched. The program will search that file, any other file contained within the same folder, or contained in any folder within the same folder. Thus, provided you lay out your database in a hierarchical fashion you can choose to limit your search to a large or small area of interest. The Search Database command leads to a sub-menu with two options, Search Comments and Search Sequences. These are used in an identical manner, except that Search Comments asks you to type in a keyword or phrase to search for. Search Comments will only find exact matches, so keep it short. Search Sequences searches for matches with the sequence displayed in the top window. After you choose your starter file, the program takes a few seconds to compile a list of files to be searched, then makes a rough estimate of the time necessary for the search. At this point you will be asked whether you wish to proceed, or to abort the search. If you go ahead with the search, the program will display each match it finds in a text report window. When searching sequences the program reports the name of the file found, the maximum contiguous length of match found, and the calculated probability of that length of matching sequence happening by chance, given two random sequences of the same length as the target and found sequences. If the match is found on the inverse strand the file name is marked by the addition of ".inv". When searching comments, the program reports the name of the file found and displays a line of text giving the context in which the target keyword was found. The keyword itself is abbreviated to its first letter followed by a hyphen. The Find PCR Primers command searches the nucleotide sequence in the front window for potential PCR primers using a modification of the method of Lowe et.al. (N.A.R. 18, 1757-1761,1990). Note that antisense primers are shown already inverted - set up your synthesiser to make them exactly as they appear. Manipulating Sequences The Modify menu contains commands which produce derivative sequences in standard ways. All open a new sequence window to hold the resulting sequence, leaving the original sequence un-altered and all except the Generate Random Sequence command operate on the top (active) window only. The Reverse command produces a sequence which is reversed in order, and is the only command on this menu which will operate on both nucleotide and protein sequences. (e.g ATTGGGCC reversed is CCCGGGTTA). The Complement command produces a sequence which is complementary to the original sequence (e.g. ATTGGGCC complemented is TAACCCGG). The Invert command both reverses an complements the sequence, generating the sequence of the strand which is biologically complementary to the original (e.g. ATTGGGCC inverted is GGCCCAAT). The Translate command translates the DNA sequence in the top window to protein (or vice versa, although because of the degenerate nature of the genetic code DNA translated from protein will be full of degenerate codes, and therefore not very useful). If you first perform an open reading frame analysis (see below) the selected reading frame will be translated up to the stop codon if present. If not, the sequence will simply be translated in reading frame 1 starting at the first base and ignoring stop codons. Stop codons are represented as a bullet mark ( ). Generate Random Sequence produces a random nucleotide sequence containing roughly equal numbers of the four bases. The Format Sequence command leads to a sub-menu which contains a selection of possible formats in which you may wish to place sequences for transfer to other programs or for printing. All of these place their results in a new text window. There are too many of these to describe in this file, but of particular interest is the Interleaved format. This is a format for publication purposes which shows the DNA and corresponding protein sequences on alternate lines. To use it you must first perform an Open Reading Frames analysis (see below), then select the appropriate ORF by clicking on it before issuing the Format > Interleaved command. Analysing sequences The analyse menu contains a number of commands which operate on the sequence in the top window (or in the case of the Align and Matrix commands in the top two windows). Each causes a new window to open of a type appropriate to the type of information being displayed. The Align command operates on the sequences in the top two windows, which must both be of the same type. The command leads to a sub-menu which permit Simple or Gapped alignment. Simple alignment causes the sequences to be displayed such that the single longest sequence of matching residues is brought into alignment. Gapped alignment causes hyphens to be inserted into either sequence to represent deletions so that the maximum number of residues is brought into alignment. Both align commands cause a new window to open to display the results : The aligned sequences are displayed in a horizontal scrolling format, which initially displays the part of the sequences where the best alignment is found. Alignment is signalled by a series of symbols between the two sequences. Where the residues are identical a bullet mark(245) is placed. Where the alignment is displaced by one position a slash (/ or \) is used. Where a potential match between degenerate codes occurs this is represented by a vertical line (|). Both sequences remain fully editable, and the alignment symbols are updated after every change, making it very easy to adjust alignment manually. Each sequence has its own scroll bar, the scroll bars being initially linked so that the two sequences scroll together. The scrollbars can be unlinked by means of the checkbox at top left, permitting alternative alignments to be explored manually. You should, however, be cautious when using this as it is very easy to lose the alignment altogether. When manually adjusting alignments it is usually safer to move the sequences one place at a time by inserting or deleting spaces from the left of the shorter sequence. You may copy sections of sequence from other sequence windows and paste them in to alignment windows. A button labelled 'Make Contig' above the aligned sequences causes a new sequence window to appear containing a consensus of the two sequences. Where a mismatch occurs between the two sequences the misaligned residues are represented by the appropriate degenerate code (nucleic acids) or X (protein sequences). Where there is no corresponding residue on the opposite sequence (represented by a space character, as in the non-overlapping ends of sequences) the un-opposed residue is carried through to the contig. Before displaying the results the program displays a dialog box which allows you to update the original sequences to match the contig. When the alignment has been made the program calculates the probability of the longest simple alignment occurring between two random sequences of the same length as the aligned sequences, displaying the result to the right of the 'Make Contig' button. You should note that in most cases this will represent an over-estimate of the probability (i.e. the result will be more significant than is suggested) because the sequences will include more than one area of alignment. In addition, when editing the aligned sequences the probability measurement is not updated to reflect improved alignment. As with the sequence windows, the program continuously displays the cursor position, in this case when the cursor is placed over either of the aligned sequences. You should note that in calculating the cursor position the program counts all characters other than the spaces at the left of the sequence, so if hyphens are introduced to represent deletions, these also will be included in the numbering. When the contents of an alignment window are printed, the entire sequences are printed out, using as many lines as necessary. Both sequences are numbered, and here the hyphens are not counted in the numbering system, so the numbering is correct in terms of the original sequences. Perfect alignment between residues is again represented by a bullet mark, but near-alignment slashes as used in the screen display are not used. Parameters for the align command allow you to set the minimum length of continuously matching sequence which the program will recognise, and the maximum number of mismatches in this length. Note that the align commands will run more slowly if you set max. mismatches > 0, or allow matches between wildcard codes. The Matrix command constructs a dot-plot matrix showing homology between the two sequences. A multiple filter function is used which displays alignments in different shades of grey, depending on the quality of match. Except for very small sequences, you will not be able to see this until you zoom in on a small area of the plot. Drag the mouse across a rectangle to enable the Zoom button. Double click on any interesting alignment to show the sequences aligned in an alignment window. The Reading Frames command operates only on nucleotide sequences. The sequence in the top window is analysed for the presence of open reading frames longer than a threshold length (The threshold is an optional parameter, default = 25 amino acids). Open reading frames are displayed in a new window as arrows, the orientation of the arrow indicating which strand was being read. The program analyses all six possible reading frames and you may (optionally) restrict the analysis to those RFs which start with a start codon (e.g. ATG). Once the reading frame analysis has been performed, you may select any of the displayed reading frames for translation by clicking on it. Issuing the Translate command will the translate this reading frame (The usual Macintosh shortcut for select / menu command applies here - simply double click on the arrow to translate it). You can also use the Format > Interleaved command (see above). The Hydro plot311 command applies only to protein sequences, and the sub-menu gives a choice of two methods of producing a hydropathicity plot : Kyte & Doolittle or Hopp & Woods. In either case the hydropathicity is calculated as a moving average over a number of aminoacids which may be set by means of the Option key (Default = 6). The resulting plot is displayed in a new window. The graphic plot window may be re-sized. This may be used to line up two or more graphic plot windows above one another for comparison. It should be noticed that the two methods adopt different conventions with respect to the orientation of the display : with K&D upward deflections represent hydrophobic sections of the protein while with H&W upward deflections represent hydrophilic sections. The Restriction command leads to a sub-menu with three commands, Edit Enzymes311, Edit Short List311 and Digest311. The Edit Enzymes command is always available, and leads to a dialog which you may use to edit the full list of enzymes and their restriction sites. The scrolling list contains the names of over 400 enzymes which the program knows. You may select any one of these for editing by clicking on it to select, then opening with the Open button (or just double click for the usual shortcut). You may edit the name and recognition sequence of the selected enzyme in the boxes at the right. The recognition sequence is entered for the 5'>3' strand only, with the cut position indicated by a vertical slash (|), and a numerical offset specifying the position of the cut on the other strand relative to this position. So, for example the enzyme Apa1, whose recognition site is : G G G C C|C C|C C G G G is entered as : Recognition site: GGGCC|C Offset: -4 If you do not specify a cut position the program will assume it is to the left of the recognition sequence as given. You may use all the IUPAC codes for indeterminate bases. If you can't remember them the 'Show Wildcards' button gives a dialog with buttons for each code, and will insert the appropriate code at the current insertion point if you exit by means of that button. When you have finished editing an enzyme, click on the 'Save' button to make the changes permanent before exiting or selecting a new enzyme to edit. You can add a new enzyme simply by changing the name (and recognition sequence) of an existing one : the old enzyme will not be overwritten; instead the new one will be added at the end of the list. Edit Short List311 leads to a dialog in which you can configure the short enzyme list to suit yourself. The short list is a convenience feature to save you having to scroll all through the main list everytime you wish to select a common enzyme. The Digest... command is available only when the top window contains a nucleotide sequence. Use the parameter dialog to set up the list of enzymes that you want to use in your digestion. The left hand scrolling list contains the names of the available enzymes (either the short or full lists can be displayed). From these you should select a list of those which you wish to use. Select each enzyme then click on Add to list >, or simply double click on the name. This will then be transferred to the right hand list. You may use the Delete < or Delete All << buttons to remove enzymes from the working list. If you simply want to know which enzymes cut (or do not cut) the test sequence, or if you wish to locate enzymes which cut (or do not cut) in a specific part of the sequence use Find enzymes... which leads to a further dialog allowing you to select one of these options. Enzymes which satisfy the criteria you have set will be automatically added to the working list. You may opt to have the cut positions and fragment sizes measured from the cut site or from the start of the recognition sequence. You may choose to have the program analyse the results of digestion with each enzyme separately, or treat the whole list as one multiple digestion, and you may specify output either as a text list or graphic map. When you have finished constructing your working list of enzymes, click on Proceed with analysis and the program will open a new window in which the results of the analysis will be displayed. If you wish to apply the same working list of enzymes to a new sequence you do not have to reconstruct it, as the list is preserved between calls to Digest311. In restriction maps you may move the map around the screen (look for the hand cursor) and re-arrange the individual sites to suit your taste by clicking and dragging. When a site is selected, you can delete it with the backspace key, or double-click on it to get a display showing the actual double-stranded sequence around the site with the cut positions marked to show what kind of ends the enzyme leaves. Communications. GeneJockey contains built-in communications routines which you may use to go on-line to your favourite mainframe database, or transfer sequences to and from other computers via a modem, or null modem lead. All of GeneJockey's facilities remain available during a communications session, and you can send sequences simply by pasting them into the communications window. Communications routines are disabled in the demonstration version of the program. GeneJockey comes with a 150 page manual giving much more detail than this file can contain, including a large tutorial section with many interesting examples of real-life sequence manipulation and analysis.