Go back
Information on Input and Output formats as given at: weizmann
Overview of the Input/Output formats

When you run most of these programs, a menu will appear offering you 
choices of the various options available for that
program. The data that the program reads should be in an input file 
called (in most cases) "infile". If there is no such file the
programs will ask you for the name of the input file. Below we describe 
the input file format, and then the menu. 

Input File Format 

I have tried to adhere to a rather stereotyped input and output format. 
For the parsimony, compatibility and maximum
likelihood programs, excluding the distance matrix methods, the simplest 
version of the input file looks something like this: 

   6   13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of species and the 
number of characters, in free format, separated by blanks
(not by commas). The information for each species follows, starting with 
a ten-character species name (which can include
punctuation marks and blanks), and continuing with the characters for 
that species. In the discrete-character, DNA and
protein sequence programs the characters are each a single letter or 
digit, sometimes separated by blanks. In the
continuous-characters programs they are real numbers with decimal points, 
separated by blanks: 

Latimeria 2.03 3.457 100.2 0.0 -3.7 

The conventions about continuing the data beyond one line per species are 
different between the molecular sequence
programs and the others. The molecular sequence programs can take the 
data in "aligned" or "interleaved" format, with some
lines giving the first part of each of the sequences, then lines giving 
the next part of each, and so on. Thus the sequences might
look like this: 

   6   39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC

Note that in these sequences we have a blank every ten sites to make them 
easier to read: any such blanks are allowed. The
blank line which separates the two groups of lines (the ones containing 
sites 1-20 and ones containing sites 21-39) may or
may not be present, but if it is, it should be a line of zero length and 
not contain any extra blank characters (this is because of a
limitation of the current versions of the programs). It is important that 
the number of sites in each group be the same for all
species (i.e., it will not be possible to run the programs successfully 
if the first species line contains 20 bases, but the first line
for the second species contains 21 bases). 

Alternatively, an option can be selected to take the data in "sequential" 
format, with all of the data for the first species, then all
of the characters for the next species, and so on. This is also the way 
that the discrete characters programs and the gene
frequencies and quantitative characters programs want to read the data. 
They do not allow the "interleaved" format. 

In the sequential format, the character data can run on to a new line at 
any time (except in a species name or in the case of
continuous character and distance matrix programs where you cannot go to 
a new line in the middle of a real number). Thus it
is legal to have: 

Archaeopt 001100
1101

or even:

Archaeopt
0011001101

though note that the FULL ten characters of the species name MUST then be 
present: in the above case there must be a
blank after the "t". In all cases it is possible to put internal blanks 
between any of the character values, so that 

Archaeopt 0011001101 0111011100 

is allowed. 

If you make an error in the input file, the programs will often detect 
that they have been fed an illegal character or illegal
numerical value and issue an error message such as "BAD CHARACTER 
STATE:", often printing out the bad value, and
sometimes the number of the species and character in which it occurred. 
The program will then stop shortly after. One of the things which can 
lead to a bad value is the omission of something earlier in
the file, or the insertion of something superfluous, which cause the 
reading of the file to get out of synchronization. The
program then starts reading things it didn't expect, and concludes that 
they are in error. So if you see this error message, you
may also want to look for the earlier problem that may have led to this. 

The other major variation on the input data format is the options 
information. Many options are selected using the menu, but a
few are selected by including extra information in the inputSome options 
are described below. 

The Options Menu 

The menu is straightforward. It typically looks like this (this one is 
for DNAPARS): 

DNA parsimony algorithm, version 3.5c 

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change) 

If you want to accept the default settings (they are shown in the above 
case) you can simply type "Y" followed by a
carriage-return (Enter) character. If you want to change any of the 
options, you should type the letter shown to the left of its
entry in the menu. For example, to set a threshold type "T". Lower-case 
letters will also work. For many of the options the
program will ask for supplementary information, such as the value of the 
threshold. 

Note the "Terminal type" entry, which you will find on all menus. It 
allows you to specify which type of terminal your screen is.
The options are an IBM PC screen, an ANSI standard terminal (such as a 
DEC VT100), a DEC VT52-compatible terminal,
such as a Zenith Z29, or no terminal type. Choosing "0" toggles among 
these four options in cyclical order, changing each time
the "0" option is chosen. If one of them is right for your terminal the 
screen will be cleared before the menu is displayed. If
none works the "none" option should probably be chosen. Keep in mind that 
VT-52 compatible terminals can freeze up if
they receive the screen-clearing commands for the ANSI standard terminal! 
If this is a problem it may be helpful to recompile
the program, setting the constants near its beginning so that the program 
starts up with the VT52 option set. 

The other numbered options control which information the program will 
display on your screen or on the output files. The
option to "Print indications of progress of run" will show information 
such as the names of the species as they are successively
added to the tree, and the progress of global rearrangements. You will 
usually want to see these as reassurance that the
program is running and to help you estimate how long it will take . But 
if you are running the program "in background" as can
be done on multitasking and multiuser systems such as Unix, and do not 
have the program running in its own window, you
may want to turn this option off so that it does not disturb your use of 
the computer while the program is running. 

The Output File 

Most of the programs write their output onto a file called (usually) 
"outfile", and a representation of the trees found onto a file
called "treefile". 

The exact contents of the output file vary from program to program and 
also depend on which menu options you have
selected. For many programs, if you select all possible output 
information, the output will consist of (1) the name of the
program and its version number, (2) the input information printed out, 
(3) a series of phylogenies, some with associated
information indicating how much change there was in each character or on 
each part of the tree. A typical rooted tree looks
like this: 

                                     +-------------------Gibbon
        +----------------------------2
        !                            !      +------------------Orang
        !                            +------4
        !                                   !  +---------Gorilla
  +-----3                                   +--6
  !     !                                      !    +---------Chimp
  !     !                                      +----5
--1     !                                           +-----Human
  !     !
  !     +-----------------------------------------------Mouse
  !
  +------------------------------------------------Bovine

The interpretation of the tree is fairly straightforward: it "grows" from 
left to right. The numbers at the forks are arbitrary and
are used (if present) merely to identify the forks. In some of the 
programs asterisks ("*") are used instead of numbers. For
many of the programs the tree produced is unrooted. It is printed out in 
nearly the same form, but with a warning message: 

remember: this is an unrooted tree! 

The warning message ("remember: ...") indicates that this is an unrooted 
tree (mathematicians still call this a tree, though some
systematists unfortunately use the term "network". This conflicts with 
standard mathematical usage, which reserves the name
"network" for a completely different kind of graph). The root of this 
tree could be anywhere, say on the line leading
immediately to Mouse. As an exercise, see if you can tell whether the 
following tree is or is not a different one from the above:

             +-----------------------------------------------Mouse
             !
   +---------4                                   +------------------Orang
   !         !                            +------3
   !         !                            !      !       +---------Chimp
---6         +----------------------------1      !  +----2
   !                                      !      +--5    +-----Human
   !                                      !         !
   !                                      !         +---------Gorilla
   !                                      !
   !                                      +-------------------Gibbon
   !
   +-------------------------------------------Bovine

   remember: this is an unrooted tree!

(it is NOT different). It is IMPORTANT also to realize that the lengths 
of the segments of the printed tree may not be
significant: some may actually represent branches of zero length, in the 
sense that there is no evidence that the branches are
nonzero in length. Some of the diagrams of trees attempt to print 
branches approximately proportional to estimated branch
lengths, while in others the lengths are purely conventional and are 
presented just to make the topology visible. You will have
to look closely at the documentation that accompanies each program to see 
what it presents and what is known about the
lengths of the branches on the tree. The above tree attempts to represent 
branch lengths approximately in the diagram. But
even in those cases, some of the smaller branches are to be artificially 
lengthened to make the tree topology clearer. Here is
what a tree from DNAPARS looks like, when no attempt is made to make the 
lengths of branches in the diagram
proportional to estimated branch lengths: 

                 +--Human
              +--5
           +--4  +--Chimp
           !  !
        +--3  +-----Gorilla
        !  !
     +--2  +--------Orang
     !  !
  +--1  +-----------Gibbon
  !  !
--6  +--------------Mouse
  !
  +-----------------Bovine

  remember: this is an unrooted tree!

Some of the parsimony programs in the package can print out a table of 
the number of steps that different characters (or sites)
require on the tree. This table may not be obvious at first. A typical 
example looks like this: 

 steps in each site:
         0   1   2   3   4   5   6   7   8   9
     *-----------------------------------------
    0!       2   2   2   2   1   1   2   2   1
   10!   1   2   3   1   1   1   1   1   1   2
   20!   1   2   2   1   2   2   1   1   1   2
   30!   1   2   1   1   1   2   1   3   1   1
   40!   1

The numbers across the top and down the side indicate which site is being 
referred to. Thus site 23 is column "3" of row "20"
and has 2 steps in this case. 

The Tree File 

In output from most programs, a representation of the tree is also 
written into the tree file (usually named "treefile"). The tree is
specified by the nested pairs of parentheses, enclosing names and 
separated by commas. If there are any blanks in the names,
these must be replaced by the underscore character "_". Trailing blanks 
in the name may be omitted. The pattern of the
parentheses indicates the pattern of the tree by having each pair of 
parentheses enclose all the members of a monophyletic
group. The tree file for the above tree would have its first line look 
like this: 

((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon)); 

In the above tree the first fork separates the lineage leading to Mouse 
and Bovine from the lineage leading to the rest. Within
the latter group there is a fork separating Gibbon from the rest, and so 
on. The entire tree is enclosed in an outermost pair of
parentheses. The tree ends with a semicolon. In some programs such as 
DNAML, FITCH, and CONTML, the tree will be
completely unrooted and specified by a bottommost fork with a three-way 
split, with three "monophyletic" groups separated
by two commas: 

(A,(B,(C,D)),(E,F)); 

The three "monophyletic" groups here are A, (B,C,D), and (E,F). The singl 
three-way split corresponds to one of the interior
nodes of the unrooted tree (it can be any interior node). The remaining 
forks are encountered as you move out from that first
node, and each then appears as a two-way split. You should check the 
documentation files for the particular programs you
are using to see in which of these forms you can expect the user tree to 
be in. Note that many of the programs that estimate an
unrooted tree produce trees in the treefile in rooted form! This is done 
for reasons of arbitrary internal bookkeeping. The
placement of the root is arbitrary. 

For programs estimating branch lengths, these are given in the trees in 
the tree file as real numbers following a colon, and
placed immediately after the group descended from that branch. Here is a 
typical tree with branch lengths: 

((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931); 

Note that the tree may continue to a new line at any time except in the 
middle of a name or the middle of a branch length,
although in trees written to the tree file this will only be done after a 
comma. 

These representations of trees are a subset of the standard adopted on 
June 24, 1986 at the annual meetings of the Society
for the Study of Evolution at an meeting (the final session in a local 
lobster restaurant) of an informal committee consisting of
Wayne Maddison (MacClade), David Swofford (PAUP), F.James Rohlf 
(NTSYS-PC), Chris Meacham (COMPROB and
plotting programs), James Archie (character coding program), William H.E. 
Day, and me. This standard is a generalization of
PHYLIP's format, itself based on a well-known representation of trees in 
terms of parenthesis patterns which has been
around for almost a century. The standard is now employed by most 
phylogeny computer programs but unfortunately has yet
to be decribed in a formal published description.