All files are plain text. Each line describes one pair of duplicate genes, where data items are separated by one or more spaces.
The first six columns of each line are:
| nsites | Number of nucleotide sites compared. |
| ndel | Number of sites deleted by the "gap expansion" algorithm. |
| S | Number of synonymous substitutions per synonymous site, computed using a maximum likelihood method. |
| R | Number of nonsynonymous (replacement) substitutions per nonsynonymous site, also computed using a maximum likelihood method. |
| ID1 | ID of the first gene in the pair. |
| ID2 | ID of the second gene. |
For those species for which we had physical map locations, there are six additional fields on each line:
| Chr1 | Chromosome ID for the first gene. |
| Start1 | Starting location of the first gene (number of nucleotides from the start of the chromosome). |
| End1 | Ending location of the first gene. |
| Chr2 | Chromosome ID of the second gene. |
| Start2 | Starting location of the second gene. |
| End2 | Ending location of the first gene. |
For S. cerevisiae and C. elegans the Chr1 and Chr2 fields are chromosome numbers. For D. melanogaster these fields contain "scaffold" numbers from Celera Genomics.
For most data sets the ID field is the NCBI genbank identifier (GI) number for the amino acid sequence we used in the initial BLAST search.
For C. elegans we used sequences obtained from the Sanger Centre Wormpep Database (Version 24, Sept. 2000). Many of these sequences did not have Genbank GI numbers, so the ID fields in the data file for C. elegans have gene names instead of GI numbers.