Information on Input and Output formats as given at:
weizmann
Overview of the Input/Output formats
When you run most of these programs, a menu will appear offering you
choices of the various options available for that
program. The data that the program reads should be in an input file
called (in most cases) "infile". If there is no such file the
programs will ask you for the name of the input file. Below we describe
the input file format, and then the menu.
Input File Format
I have tried to adhere to a rather stereotyped input and output format.
For the parsimony, compatibility and maximum
likelihood programs, excluding the distance matrix methods, the simplest
version of the input file looks something like this:
6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC
The first line of the input file contains the number of species and the
number of characters, in free format, separated by blanks
(not by commas). The information for each species follows, starting with
a ten-character species name (which can include
punctuation marks and blanks), and continuing with the characters for
that species. In the discrete-character, DNA and
protein sequence programs the characters are each a single letter or
digit, sometimes separated by blanks. In the
continuous-characters programs they are real numbers with decimal points,
separated by blanks:
Latimeria 2.03 3.457 100.2 0.0 -3.7
The conventions about continuing the data beyond one line per species are
different between the molecular sequence
programs and the others. The molecular sequence programs can take the
data in "aligned" or "interleaved" format, with some
lines giving the first part of each of the sequences, then lines giving
the next part of each, and so on. Thus the sequences might
look like this:
6 39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC
TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC
Note that in these sequences we have a blank every ten sites to make them
easier to read: any such blanks are allowed. The
blank line which separates the two groups of lines (the ones containing
sites 1-20 and ones containing sites 21-39) may or
may not be present, but if it is, it should be a line of zero length and
not contain any extra blank characters (this is because of a
limitation of the current versions of the programs). It is important that
the number of sites in each group be the same for all
species (i.e., it will not be possible to run the programs successfully
if the first species line contains 20 bases, but the first line
for the second species contains 21 bases).
Alternatively, an option can be selected to take the data in "sequential"
format, with all of the data for the first species, then all
of the characters for the next species, and so on. This is also the way
that the discrete characters programs and the gene
frequencies and quantitative characters programs want to read the data.
They do not allow the "interleaved" format.
In the sequential format, the character data can run on to a new line at
any time (except in a species name or in the case of
continuous character and distance matrix programs where you cannot go to
a new line in the middle of a real number). Thus it
is legal to have:
Archaeopt 001100
1101
or even:
Archaeopt
0011001101
though note that the FULL ten characters of the species name MUST then be
present: in the above case there must be a
blank after the "t". In all cases it is possible to put internal blanks
between any of the character values, so that
Archaeopt 0011001101 0111011100
is allowed.
If you make an error in the input file, the programs will often detect
that they have been fed an illegal character or illegal
numerical value and issue an error message such as "BAD CHARACTER
STATE:", often printing out the bad value, and
sometimes the number of the species and character in which it occurred.
The program will then stop shortly after. One of the things which can
lead to a bad value is the omission of something earlier in
the file, or the insertion of something superfluous, which cause the
reading of the file to get out of synchronization. The
program then starts reading things it didn't expect, and concludes that
they are in error. So if you see this error message, you
may also want to look for the earlier problem that may have led to this.
The other major variation on the input data format is the options
information. Many options are selected using the menu, but a
few are selected by including extra information in the inputSome options
are described below.
The Options Menu
The menu is straightforward. It typically looks like this (this one is
for DNAPARS):
DNA parsimony algorithm, version 3.5c
Setting for this run:
U Search for best tree? Yes
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, VT52, ANSI)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
If you want to accept the default settings (they are shown in the above
case) you can simply type "Y" followed by a
carriage-return (Enter) character. If you want to change any of the
options, you should type the letter shown to the left of its
entry in the menu. For example, to set a threshold type "T". Lower-case
letters will also work. For many of the options the
program will ask for supplementary information, such as the value of the
threshold.
Note the "Terminal type" entry, which you will find on all menus. It
allows you to specify which type of terminal your screen is.
The options are an IBM PC screen, an ANSI standard terminal (such as a
DEC VT100), a DEC VT52-compatible terminal,
such as a Zenith Z29, or no terminal type. Choosing "0" toggles among
these four options in cyclical order, changing each time
the "0" option is chosen. If one of them is right for your terminal the
screen will be cleared before the menu is displayed. If
none works the "none" option should probably be chosen. Keep in mind that
VT-52 compatible terminals can freeze up if
they receive the screen-clearing commands for the ANSI standard terminal!
If this is a problem it may be helpful to recompile
the program, setting the constants near its beginning so that the program
starts up with the VT52 option set.
The other numbered options control which information the program will
display on your screen or on the output files. The
option to "Print indications of progress of run" will show information
such as the names of the species as they are successively
added to the tree, and the progress of global rearrangements. You will
usually want to see these as reassurance that the
program is running and to help you estimate how long it will take . But
if you are running the program "in background" as can
be done on multitasking and multiuser systems such as Unix, and do not
have the program running in its own window, you
may want to turn this option off so that it does not disturb your use of
the computer while the program is running.
The Output File
Most of the programs write their output onto a file called (usually)
"outfile", and a representation of the trees found onto a file
called "treefile".
The exact contents of the output file vary from program to program and
also depend on which menu options you have
selected. For many programs, if you select all possible output
information, the output will consist of (1) the name of the
program and its version number, (2) the input information printed out,
(3) a series of phylogenies, some with associated
information indicating how much change there was in each character or on
each part of the tree. A typical rooted tree looks
like this:
+-------------------Gibbon
+----------------------------2
! ! +------------------Orang
! +------4
! ! +---------Gorilla
+-----3 +--6
! ! ! +---------Chimp
! ! +----5
--1 ! +-----Human
! !
! +-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
The interpretation of the tree is fairly straightforward: it "grows" from
left to right. The numbers at the forks are arbitrary and
are used (if present) merely to identify the forks. In some of the
programs asterisks ("*") are used instead of numbers. For
many of the programs the tree produced is unrooted. It is printed out in
nearly the same form, but with a warning message:
remember: this is an unrooted tree!
The warning message ("remember: ...") indicates that this is an unrooted
tree (mathematicians still call this a tree, though some
systematists unfortunately use the term "network". This conflicts with
standard mathematical usage, which reserves the name
"network" for a completely different kind of graph). The root of this
tree could be anywhere, say on the line leading
immediately to Mouse. As an exercise, see if you can tell whether the
following tree is or is not a different one from the above:
+-----------------------------------------------Mouse
!
+---------4 +------------------Orang
! ! +------3
! ! ! ! +---------Chimp
---6 +----------------------------1 ! +----2
! ! +--5 +-----Human
! ! !
! ! +---------Gorilla
! !
! +-------------------Gibbon
!
+-------------------------------------------Bovine
remember: this is an unrooted tree!
(it is NOT different). It is IMPORTANT also to realize that the lengths
of the segments of the printed tree may not be
significant: some may actually represent branches of zero length, in the
sense that there is no evidence that the branches are
nonzero in length. Some of the diagrams of trees attempt to print
branches approximately proportional to estimated branch
lengths, while in others the lengths are purely conventional and are
presented just to make the topology visible. You will have
to look closely at the documentation that accompanies each program to see
what it presents and what is known about the
lengths of the branches on the tree. The above tree attempts to represent
branch lengths approximately in the diagram. But
even in those cases, some of the smaller branches are to be artificially
lengthened to make the tree topology clearer. Here is
what a tree from DNAPARS looks like, when no attempt is made to make the
lengths of branches in the diagram
proportional to estimated branch lengths:
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
remember: this is an unrooted tree!
Some of the parsimony programs in the package can print out a table of
the number of steps that different characters (or sites)
require on the tree. This table may not be obvious at first. A typical
example looks like this:
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 2 2 2 1 1 2 2 1
10! 1 2 3 1 1 1 1 1 1 2
20! 1 2 2 1 2 2 1 1 1 2
30! 1 2 1 1 1 2 1 3 1 1
40! 1
The numbers across the top and down the side indicate which site is being
referred to. Thus site 23 is column "3" of row "20"
and has 2 steps in this case.
The Tree File
In output from most programs, a representation of the tree is also
written into the tree file (usually named "treefile"). The tree is
specified by the nested pairs of parentheses, enclosing names and
separated by commas. If there are any blanks in the names,
these must be replaced by the underscore character "_". Trailing blanks
in the name may be omitted. The pattern of the
parentheses indicates the pattern of the tree by having each pair of
parentheses enclose all the members of a monophyletic
group. The tree file for the above tree would have its first line look
like this:
((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon));
In the above tree the first fork separates the lineage leading to Mouse
and Bovine from the lineage leading to the rest. Within
the latter group there is a fork separating Gibbon from the rest, and so
on. The entire tree is enclosed in an outermost pair of
parentheses. The tree ends with a semicolon. In some programs such as
DNAML, FITCH, and CONTML, the tree will be
completely unrooted and specified by a bottommost fork with a three-way
split, with three "monophyletic" groups separated
by two commas:
(A,(B,(C,D)),(E,F));
The three "monophyletic" groups here are A, (B,C,D), and (E,F). The singl
three-way split corresponds to one of the interior
nodes of the unrooted tree (it can be any interior node). The remaining
forks are encountered as you move out from that first
node, and each then appears as a two-way split. You should check the
documentation files for the particular programs you
are using to see in which of these forms you can expect the user tree to
be in. Note that many of the programs that estimate an
unrooted tree produce trees in the treefile in rooted form! This is done
for reasons of arbitrary internal bookkeeping. The
placement of the root is arbitrary.
For programs estimating branch lengths, these are given in the trees in
the tree file as real numbers following a colon, and
placed immediately after the group descended from that branch. Here is a
typical tree with branch lengths:
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
Note that the tree may continue to a new line at any time except in the
middle of a name or the middle of a branch length,
although in trees written to the tree file this will only be done after a
comma.
These representations of trees are a subset of the standard adopted on
June 24, 1986 at the annual meetings of the Society
for the Study of Evolution at an meeting (the final session in a local
lobster restaurant) of an informal committee consisting of
Wayne Maddison (MacClade), David Swofford (PAUP), F.James Rohlf
(NTSYS-PC), Chris Meacham (COMPROB and
plotting programs), James Archie (character coding program), William H.E.
Day, and me. This standard is a generalization of
PHYLIP's format, itself based on a well-known representation of trees in
terms of parenthesis patterns which has been
around for almost a century. The standard is now employed by most
phylogeny computer programs but unfortunately has yet
to be decribed in a formal published description.