ALIGNMENT PARAMETERS README for Clustal W version 1.5 April 1995 A) FILE INPUT (sequences to be aligned) The sequences must all be in one file (or two files for a "profile alignment") in ONE of the following formats: FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/MSF. The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way. Format First non blank word or character in the file. ............................................................... FASTA > NBRF >P1; or >D1; EMBL/SWISS ID GDE protein % GDE nucleotide # CLUSTAL CLUSTAL (blocked multiple alignments) GCG/MSF PILEUP " " " Note, that the only way of spotting that a file is MSF format is if the word PILEUP appears at the very beginning of the file. If you produce this format from software other than the GCG pileup program, then you will have to insert the word PILEUP at the start of the file. Similarly, if you use clustal format, the word CLUSTAL must appear first. All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. With CLUSTAL format, this is just the same as the output format of this program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences must be the same length, INCLUDING GAPS ("_" in clustal format; "." in MSF). With the other formats, sequences can be gapped with "-" charcters. If you read in any gaps these are kept during any later alignments. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities. DNA vs. PROTEIN: the program will count the number of A,C,G,T,U and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise. B) FILE OUTPUT 1) the alignments. In the multiple alignment and profile alignment menus, there is a menu item to control the output format(s). The alignment output format can be set to any (or all) of: CLUSTAL (a self explanatory blocked alignment) NBRF/PIR (same as input format but with "-" characters for gaps) MSF (the main GCG package multiple alignment format) PHYLIP (Joe Felsenstein's phylogeny inference package. Gaps are set to "-" characters. For some programs (e.g. PROTPARS/DNAPARS) these should be changed to "?" characters for unknown residues. GDE (Used by Steven Smith's GDE package) You can also choose between having the sequences in the same order as in the input file or writing them out in an order that more closely matches the order used to carry out the multiple alignment. 2) THE ALIGNMENT ALGORITHMS The basic algorithm is the same as for Clustal V and is described in some detail in clustalv.doc. The new modifications are described in detail in clustalw.ms. Here we just list some notes to help answer some of the most obvious questions. Terminal Gaps In the original Clustal V program, terminal gaps were penalised the same as all other gaps. This caused some ugly side effects e.g. acgtacgtacgtacgt acgtacgtacgtacgt a----cgtacgtacgt gets the same score as ----acgtacgtacgt NOW, terminal gaps are free. This is better on average and stops silly effects like single residues jumping to the edge of the alignment. However, it is not perfect. It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e. cccccgggccccc cccccgggccccc ccccc---ccccc may be considered worse (lower score) than cccccccccc--- In the right hand case above, the terminal gap is free and may score higher than the laft hand alignment. This can be prevented by lowering the gap opening and extension penalties. It is difficult to get this right all the time. Please watch the ends of your alignments. 3)Speed of the initial (pairwise) alignments (fast approximate/slow accurate) By default, the initial pairwise alignments are now carried out using a full dynamic programming algorithm. This is more accurate than the older hash/ k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast workstation you may not notice but on a slow box, the difference is extreme. You can set the alignment method from the menus easily to the older, faster method. 4)Delaying alignment of distant sequences The user can set a cut off to delay the alignment of the most divergent sequences in a data set until all other sequences have been aligned. By default, this is set to 40% which means that if a sequence is less than 40% identical to any other sequence, its alignment will be delayed. 5)Iterative realignment/Reset gaps between alignments By default, if you align a set of sequences a second time (e.g. with changed gap penalties), the gaps from the first alignment are discarded. You can set this from the menus so that older gaps will be kept between alignments, This can sometimes give better alignments by keeping the gaps (do not reset them) and doing the full multiple alignment a second time. Sometimes, the alignment will converge on a better solution; sometimes the new alignment will be the same as the first. There can be a strange side effect: you can get columns of nothing but gaps introduced. Any gaps that are read in from the input file are always kept, regardless of the setting of this switch. If you read in a full multiple alignment, the "reset gaps" switch has no effect. The old gaps will remain and if you carry out a multiple alignment, any new gaps will be added in. If you wish to carry out a full new alignment of a set of sequences that are already aligned in a file you must input the sequences without gaps. 6)Profile alignment By profile alignment, we simply mean the alignment of old alignments/sequences. In this context, a profile is just an existing alignment (or even a set of unaligned sequences; see below). This allows you to read in an old alignment (in any of the allowed input formats) and align one or more new sequences to it. From the profile alignment menu, you are allowed to read in 2 profiles. Either profile can be a full alignment OR a single sequence. In the simplest mode, you simply align the two profiles to each other. This is useful if you want to gradually build up a full multiple alignment. A second option is to align the sequences from the second profile, one at a time to the first profile. This is done, taking the underlying tree between the sequences into account. This is useful if you have a set of new sequences (not aligned) and you wish to add them all to an older alignment.