Gene Duplication Data File Format

Michael Lynch
John S. Conery
Computational Science Institute
University of Oregon

Line Format

All files are plain text. Each line describes one pair of duplicate genes, where data items are separated by one or more spaces.

The first six columns of each line are:
nsitesNumber of nucleotide sites compared.
ndelNumber of sites deleted by the "gap expansion" algorithm.
SNumber of synonymous substitutions per synonymous site, computed using a maximum likelihood method.
RNumber of nonsynonymous (replacement) substitutions per nonsynonymous site, also computed using a maximum likelihood method.
ID1ID of the first gene in the pair.
ID2ID of the second gene.

For those species for which we had physical map locations, there are six additional fields on each line:
Chr1Chromosome ID for the first gene.
Start1Starting location of the first gene (number of nucleotides from the start of the chromosome).
End1Ending location of the first gene.
Chr2Chromosome ID of the second gene.
Start2Starting location of the second gene.
End2Ending location of the first gene.

For S. cerevisiae and C. elegans the Chr1 and Chr2 fields are chromosome numbers. For D. melanogaster these fields contain "scaffold" numbers from Celera Genomics.

Gene Identifiers

For most data sets the ID field is the NCBI genbank identifier (GI) number for the amino acid sequence we used in the initial BLAST search.

For C. elegans we used sequences obtained from the Sanger Centre Wormpep Database (Version 24, Sept. 2000). Many of these sequences did not have Genbank GI numbers, so the ID fields in the data file for C. elegans have gene names instead of GI numbers.