Software for Analyzing Nucleotide Substitutions

Michael Lynch
John S. Conery
Computational Science Institute
University of Oregon

This web page has links to the software used to create the data sets for our paper on the evolution of duplicate genes (see http://www.csi.uoregon.edu/projects/genetics/duplications).

A package called ntdiffs is a collection of C++ applications and its associated class library. The main piece of software is an application named ntalign, which aligns two nucleotide sequences, counts the number of differences between the sequences, and uses those counts to estimate the number of substitutions that have occurred in the evolution of the sequences from a common ancestor.

Also available are a set of Perl scripts that manage data files used as inputs to or generated as output from ntalign.

Contents

Requirements Copyrights
The ntdiffs Package Perl Scripts
Help and Documentation Suggestions for Using the Programs
Sample Data Status

Requirements and Related Software

The C++ programs in the ntdiffs package were written using the C++ Standard Library, also known as the Standard Template Library, or STL. If you are going to compile the programs, you need to have a compiler that supports templates and have a locally installed copy of the STL. The STL website at SGI (www.sgi.com/Technology/STL) has user documentation and copies for downloading.

The Perl scripts were written in Perl5. Visit www.perl.org for information about the latest version of Perl.

Although not strictly necessary for running ntalign or most of the Perl scripts described below, the SEALS package (System for Easy Analysis of Lots of Sequences) by Roland Walker at NCBI is extremely useful for building data files and running BLAST. Documentation and source code are available from www.ncbi.nlm.nih.gov/Walker/SEALS.

If you want to run BLAST locally you can find a version for your system at the NCBI FTP Site.

We used the PAML package (Phylogenetic Analysis by Maximum Likelihood) from Ziheng Yang as one of the methods for estimating nucleotide substitution. One of the Perl scripts below is a "scripter" for the PAML program named codeml; it builds the appropriate data and control files, launches codeml, and extracts the data values from the output files. Go to abacus.gene.ucl.ac.uk/software/paml.html to obtain the code and documentation for PAML.

Some of the data files generated by our software are in the form of Matlab "m-files" that contain data values plus Matlab commands to display the data as a dot-plot or bar chart. Matlab is a commercial package, available from Mathworks (www.mathworks.com). If you don't use Matlab, it will be a simple matter to extract the data from the m-file or reformat it for another data analysis package; the m2txt script described below is an example.

Copyrights

The Perl scripts listed below can be freely copied and distributed without any restrictions.

The C++ programs in the ntdiffs package are copyrighted. They may be used without fee for research and other noncommercial purposes, but developers who wish to include all or parts of this software in commercial products should send e-mail to conery@cs.uoregon.edu.

The ntdiffs Package

ntdiffs.tar.Z
Compressed tar file for the ntdiffs package. Uncompressing and untarring creates a new directory named ntdiffs with subdirectories for documentation (doc), the class library (lib), a set of demo programs (demo), and three applications (ntalign, kappa, and cubit). cd to the top level ntdiffs directory and type make to compile all the classes, the demo programs, and all the applications.

Perl Scripts

See the section on
Suggested Uses below for examples of how these scripts can be used along with scripts in the SEALS package to create data files and analyze them with the ntalign program.
blast2align
Extract alignments from a stream of BLAST reports. For each alignment, the script writes the IDs of the two sequences being compared, the expect value of the comparison, and the alignment produced by BLAST. This is the format expected by the ntalign application.
align2gi
Extract the gene IDs from an alignment file. The IDs are sorted and duplicates are removed.
feature2fasta
An updated version of the feature2fasta file of the SEALS package. The format of Genbank records changes since the last release of SEALS; use this version if you are building nucleotide sequence files.
insertml
Merge the results of the codeml program of the PAML package with an existing data file (see mldiffs, below).
m2txt
Extract data from a Matlab m-file produced by ntalign.
matchstats
Process the log file produced by ntalign and do an "audit" that tries to account for every input sequence. Produces a report that lists the number of sequences compared, including the number of data points lost due to formatting errors, missing sequences, etc.
mldiffs
A scripter for the codeml program. If the ntalign program is run with the "-paml" switch it saves the aligned nucleotide sequences in a file. This script reads that file, grabbing two sequences at a time. It uses the sequences to construct a control file, launches codeml, and then extracts the Ks and Ka data values from the codeml output files.
orfify
Add gene ID information from an "ORF file" to an existing data file.

Help and Documentation

To see a quick synopsis of a program just type the program name and "-help", e.g.
% ntalign -help
More extensive documentation of the C++ programs can be found in the doc subdirectory of the ntdiffs package or via one of these links:

Suggestions for Using the Programs

For the computational experiments described in our paper, we created a set of data files for several different genomes. A good way to organize the data is to use the common name of the species as the root name of the file and a standard extension to identify the format of the data in the file, e.g. "yeast.nt" is the set of nucleotide sequences for the yeast genome and "yeast.align" is the set of BLAST alignments for yeast.

The following table lists some standard extensions. Some were adopted from the SEALS package, others were introduced for this project.

.gi List of GI numbers (integers only, one per line).
.pt.gi GI numbers of pseudogenes and transposons.
.fa Amino acid sequences in Fasta format.
.nt Nucleotide sequences in Fasta format.
.align BLAST alignments.
.m Matlab format output from ntalign.
.txt ntalign output in plain text format.
.paml Aligned nucleotide sequences saved by ntalign.
.err Log file output from ntalign.
We used the following steps to create the data files for each genome (where X stands for the species name). Note: some long command lines have been split into two lines for readability.

  1. Obtain a list of all protein sequences for the species. Go to the NCBI Batch Entrez server. Use the form to select "protein sequences", "download list of GIs", "retrieve all sequences for a specific organism", and the organism (either from menu or by name). Save the result in a file named X.gi.

  2. Create an amino acid Fasta file for each sequence using SEALS:
    % bert X.gi | gi2genbank | feature2fasta -feature= protein 
        -defline '>gi|$gi|gb|$accession $definition [$organism]' > X.all.fa
    

  3. There are probably fragments and other junk in the Fasta file. Here is a simple filter that removes all sequences that do not start with M or are shorter than 200 letters, again using SEALS:
    % daffy X.all.fa 200 | gref '^M' > X.fa
    
    If you don't need to filter, just rename X.all.fa to X.fa. You can delete X.all.fa at this point.

  4. Run BLAST on the fasta protein file, comparing each sequence against all others in the file, and pipe the BLAST output to blast2align to have it save all alignments with E < 1e-10:
    % bert X.fa | splishpgp -d X.fa -gapped -proc smart | blast2align 1e-10 > X.align
    

  5. Create the nucleotide sequence file. Download nucleotides for each protein that figures in a significant match:
    % align2gi X.align | gi2genbank | ~/bin/feature2fasta -feature= cds -use= coded_by > X.nt
    
    NOTE 1: For this step you need to use the updated version of feature2fasta. Replace "~/bin" above with the pathname to where you installed your local copies of the scripts.
    NOTE 2: The ntalign program requires amino acid sequences and nucleotide sequences for the same gene to have the same GI number in their deflines. The command shown above will do that; if you build the nucleotide sequences with some other method make sure the deflines start ">gi|N" where N is the GI number of the corresponding amino acid sequence.

  6. Run the ntalign program to get estimates of Ks and Ka:
    % ntalign X.align X -max 5 -expand -paml X.paml -log X.err > X.m
    
    The above command line sets a "gene family size" cutoff of 5 (i.e. ignore any sequence that has more than 5 matches). It also tells ntalign to save the aligned nucleotide sequences for later use by the PAML programs and to keep a record of all steps taken.

  7. If you want to use codeml to give another estimate of Ks and Ka, first use the kappa program to compute the transition/transversion ratio:
    % kappa X.paml -n 20 -use 3 > Xk.m
    
    This command sorts the matches into 20 bins according to their Ks values and uses the first 3 bins to count the number of transitions and transversions. The ratio is printed on stdout. Then use the mldiffs script to run codeml for each pair in the PAML file:
    % mldiffs X.paml N X.m > X.mldiffs.txt
    
    Here N is the transition/transversion ratio computed by the kappa program. Supplying the name of the ntalign output file is optional; if it is given, the mldiffs script will not launch codeml on a pair if Ka is too big (or NaN), which can save a substantial amount of time.

Sample Data

Use the links below to download copies of data files that can be used to test the programs. The sequences are from the zebrafish (Danio rerio) genome.

Status

As of 6/26/00: