This web page has links to the software used to create the data sets for our paper on the evolution of duplicate genes (see http://www.csi.uoregon.edu/projects/genetics/duplications).
A package called ntdiffs is a collection of C++ applications and its associated class library. The main piece of software is an application named ntalign, which aligns two nucleotide sequences, counts the number of differences between the sequences, and uses those counts to estimate the number of substitutions that have occurred in the evolution of the sequences from a common ancestor.
Also available are a set of Perl scripts that manage data files used as inputs to or generated as output from ntalign.
| Requirements | Copyrights |
| The ntdiffs Package | Perl Scripts |
| Help and Documentation | Suggestions for Using the Programs |
| Sample Data | Status |
The C++ programs in the ntdiffs package were written using the C++ Standard Library, also known as the Standard Template Library, or STL. If you are going to compile the programs, you need to have a compiler that supports templates and have a locally installed copy of the STL. The STL website at SGI (www.sgi.com/Technology/STL) has user documentation and copies for downloading.
The Perl scripts were written in Perl5. Visit www.perl.org for information about the latest version of Perl.
Although not strictly necessary for running ntalign or most of the Perl scripts described below, the SEALS package (System for Easy Analysis of Lots of Sequences) by Roland Walker at NCBI is extremely useful for building data files and running BLAST. Documentation and source code are available from www.ncbi.nlm.nih.gov/Walker/SEALS.
If you want to run BLAST locally you can find a version for your system at the NCBI FTP Site.
We used the PAML package (Phylogenetic Analysis by Maximum Likelihood) from Ziheng Yang as one of the methods for estimating nucleotide substitution. One of the Perl scripts below is a "scripter" for the PAML program named codeml; it builds the appropriate data and control files, launches codeml, and extracts the data values from the output files. Go to abacus.gene.ucl.ac.uk/software/paml.html to obtain the code and documentation for PAML.
Some of the data files generated by our software
are in the form of Matlab "m-files" that contain data values
plus Matlab commands to display the data as a dot-plot
or bar chart.
Matlab is a commercial
package, available from Mathworks
(www.mathworks.com). If you
don't use Matlab, it will be a simple matter to extract the
data from the m-file or reformat it for another data analysis
package; the m2txt script described below is an example.
The Perl scripts listed below can be freely copied and distributed without any
restrictions.
The C++ programs in the ntdiffs package are copyrighted. They may be
used without fee for research and other noncommercial purposes, but developers who
wish to include all or parts of this software in commercial products should
send e-mail to conery@cs.uoregon.edu.
The following table lists some standard extensions. Some were
adopted from the SEALS package, others were introduced for this
project.
Copyrights
The ntdiffs Package
Perl Scripts
See the section on Suggested Uses below for examples of how
these scripts can be used along with scripts in the SEALS package
to create data files and analyze them with the ntalign program.
Help and Documentation
To see a quick synopsis of a program just type the program name
and "-help", e.g.
% ntalign -help
More extensive documentation of the C++ programs can be found
in the doc subdirectory of the ntdiffs package or via one of
these links:
Suggestions for Using the Programs
For the computational experiments described in our paper, we
created a set of data files for several different genomes. A
good way to organize the data is to use the common name of the
species as the root name of the file and a standard extension
to identify the format of the data in the file, e.g. "yeast.nt"
is the set of nucleotide sequences for the yeast genome and
"yeast.align" is the set of BLAST alignments for yeast.
| .gi | List of GI numbers (integers only, one per line). |
| .pt.gi | GI numbers of pseudogenes and transposons. |
| .fa | Amino acid sequences in Fasta format. |
| .nt | Nucleotide sequences in Fasta format. |
| .align | BLAST alignments. |
| .m | Matlab format output from ntalign. |
| .txt | ntalign output in plain text format. |
| .paml | Aligned nucleotide sequences saved by ntalign. |
| .err | Log file output from ntalign. |
% bert X.gi | gi2genbank | feature2fasta -feature= protein
-defline '>gi|$gi|gb|$accession $definition [$organism]' > X.all.fa
% daffy X.all.fa 200 | gref '^M' > X.faIf you don't need to filter, just rename X.all.fa to X.fa. You can delete X.all.fa at this point.
% bert X.fa | splishpgp -d X.fa -gapped -proc smart | blast2align 1e-10 > X.align
% align2gi X.align | gi2genbank | ~/bin/feature2fasta -feature= cds -use= coded_by > X.ntNOTE 1: For this step you need to use the updated version of feature2fasta. Replace "~/bin" above with the pathname to where you installed your local copies of the scripts.
% ntalign X.align X -max 5 -expand -paml X.paml -log X.err > X.mThe above command line sets a "gene family size" cutoff of 5 (i.e. ignore any sequence that has more than 5 matches). It also tells ntalign to save the aligned nucleotide sequences for later use by the PAML programs and to keep a record of all steps taken.
% kappa X.paml -n 20 -use 3 > Xk.mThis command sorts the matches into 20 bins according to their Ks values and uses the first 3 bins to count the number of transitions and transversions. The ratio is printed on stdout. Then use the mldiffs script to run codeml for each pair in the PAML file:
% mldiffs X.paml N X.m > X.mldiffs.txtHere N is the transition/transversion ratio computed by the kappa program. Supplying the name of the ntalign output file is optional; if it is given, the mldiffs script will not launch codeml on a pair if Ka is too big (or NaN), which can save a substantial amount of time.
As of 6/26/00: