"Adaptive evolution of non-coding DNA in Drosophila".
Peter Andolfatto 1
1 Section of Ecology, Behavior and Evolution, Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, USA
1 Correspondence and requests for materials should be
addressed to P.A.
( Email: pandolfatto@ucsd.edu
)
A large fraction of eukaryotic genomes consists of DNA that is not
translated into protein sequence, and little is known about its functional
significance. Here I show that several classes of non-coding DNA in Drosophila
are evolving considerably slower than synonymous sites, and yet show an
excess of between-species divergence relative to polymorphism when compared
with synonymous sites. The former is a hallmark of selective constraint,
but the latter is a signature of adaptive evolution, resembling general
patterns of protein evolution in Drosophila [1, 2].
I estimate that about 40–70% of nucleotides in intergenic regions, untranslated
portions of mature mRNAs (UTRs) and most intronic DNA are evolutionarily
constrained relative to synonymous sites. However, I also use an extension
to the McDonald–Kreitman test [3] to show that a substantial
fraction of the nucleotide divergence in these regions was driven to fixation
by positive selection (about 20% for most intronic and intergenic DNA,
and 60% for UTRs). On the basis of these observations, I suggest
that a large fraction of the non-translated genome is functionally important
and subject to both purifying selection and adaptive evolution. These results
imply that, although positive selection is clearly an important facet of
protein evolution, adaptive changes to non-coding DNA might have
been considerably
more common in the evolution of D. melanogaster.
There are some limitations to the comparative genomics approach. First, a given genomic region might be conserved owing simply to a lower mutation rate [12]. Second, known regulatory elements do not seem to be particularly well conserved as a class, at least in Drosophila [10]. This finding suggests that taking an approach based on sequence conservation alone may lead to a biased view of regulatory evolution. Functionality of DNA sequences implies that they can be subject to both negative and positive selection. If a significant fraction of divergence between species observed in non-coding DNA is positively selected rather than selectively neutral or constrained, this could lead to underestimates of the functional importance of non-coding DNA and cause researchers to overlook the contribution of arguably the most interesting class of mutations in genome evolution—those reflecting adaptive differences between populations and species.
These limitations can be overcome by combining comparative genomic analyses with population-level variability data [1, 2, 3, 13]. To assess the mode of selection acting on non-coding DNA, I have analysed new and previously published polymorphism data for 35 coding fragments (average length 667 base pairs (bp)) and 153 non-coding fragments (average length 426 bp) scattered across the X chromosome of D. melanogaster (see Supplementary Materials 1). To estimate levels of between-species divergence, I have compared D. melanogaster with its closely related sibling species, D. simulans.
On the basis of the current Drosophila genome annotation (release
4), I separated the surveyed fragments into several categories that are
likely to differ in the intensity and mode of selection acting on them
(see Table 1). It is apparent that most non-coding DNA
evolves considerably slower than synonymous sites (that is, sites in protein-coding
sequences at which mutations do not result in amino acid substitutions;
Table
1). This is the case for introns and UTRs (see also refs 14,
15–16),
as well as intergenic DNA, much of which is far from the closest known
gene (see Supplementary Materials 1).
I estimate levels of constraint in Drosophila non-coding DNA to
be 40% for introns, 50% for intergenic regions (IGRs), and 60% for
UTRs (Table 2). These are all considerably higher
than previous estimates from a variety of species comparisons [11,
15,
16, 17, 18. The non-coding DNA surveyed is also generally less polymorphic
than synonymous sites in D. melanogaster (Table 1;
p
< 10-10, Wilcoxon two-sample test for UTRs and introns +
IGRs versus synonymous sites). Thus, both polymorphism and divergence in
non-coding DNA are significantly reduced relative to synonymous sites in
D. melanogaster.
Table 1: Polymorphism and divergence in coding and non-coding
DNA of D. melanogaster.
Table 1. Polymorphism and divergence in coding and non-coding DNA of D. melanogaster.
Table 2 - Functionally relevant nucleotides in non-coding DNA.
Reduced levels of polymorphism and divergence in non-coding DNA resemble
general patterns of protein evolution [19] and suggest
that non-coding DNA is either functionally constrained or is subject to
a lower mutation rate than synonymous sites. One way to distinguish between
these two models is to consider the distribution of polymorphism frequencies.
Negative selection acting on polymorphic variants will keep them at lower
frequencies in a population than expected if they were neutral [20].
Consistent with this prediction, the distribution of polymorphism frequencies
at both non-coding DNA and amino acid sites is skewed towards rare frequencies
relative to synonymous polymorphisms (as indicated by a more negative Tajima's
D
value [20], Fig. 1). The distribution
of Tajima's D values for non-synonymous sites among loci is negatively
skewed relative to synonymous sites, suggesting that amino acid polymorphisms
are subject to purifying selection (Fig. 1; p
= 0.002, Wilcoxon two-sample test versus synonymous sites). Here I show
that this same pattern extends to polymorphisms in non-coding DNA (Fig.
1; Wilcoxon test versus synonymous sites: pooled non-coding, p
= 0.0001; UTRs, p < 0.0001; introns, p = 0.001; IGRs,
p
= 0.005). This finding, together with the observed reduction in polymorphism
and divergence, implies that mutations in non-coding DNA are subject, on
average, to stronger negative selection than synonymous sites (see also
Supplementary Materials 2).
Figure 1: Mean Tajima's D values for coding and non-coding
DNA.
FIGURE 1. Mean Tajima's D values for coding and non-coding DNA.
Means across loci are given with bars indicating two standard errors. The expectation of D under the neutral model is shown as a dotted line. Syn, synonymous sites; NonSyn, non-synonymous sites; NonCod, pooled non-coding DNA.
Does selective constraint alone account for patterns of non-coding DNA evolution? McDonald and Kreitman [3] have proposed a framework to distinguish neutrality (and variation in mutation rate) from negative and positive selection in the genome. Their approach compares levels of polymorphism within and divergence between species for a putatively selected class of sites in the genome to a neutral standard. If reduced levels of polymorphism and divergence in non-coding DNA can be explained by a lower mutation rate, the ratio of polymorphism to divergence should be similar to that for synonymous sites. Positive selection will increase divergence relative to polymorphism at selected sites, whereas negative selection is expected to result in the opposite pattern [21]. Although this framework was originally designed to detect selection within protein-coding genes, it can be generalized to consider arbitrary classes of putatively selected sites sampled from multiple genomic regions, including non-coding DNA (see Supplementary Materials 2). Using all polymorphisms, there is a significant excess of divergence for amino acid replacement sites (p = 5 x 10-7) and for UTRs (p = 3 x 10-6, two-tailed Fisher's exact test) but not at other subclasses of non-coding DNA (Table 1). This preliminary analysis suggests that, similar to the pattern observed for amino acid substitutions ]1, 2], a significant proportion of nucleotide divergence at UTRs was also driven to fixation by positive selection.
The presence of weakly negatively selected variants in polymorphism can mask the signature of adaptive evolution in the genome [1, 22], making the McDonald–Kreitman test very conservative. As I have shown above that polymorphic variants in non-coding DNA are subject to stronger selective constraint than synonymous sites (Table 1 and Fig. 1), negatively selected variants contributing to polymorphism in non-coding DNA are likely to be a factor limiting power to detect positive selection. This problem can be partially overcome by considering only those mutations that are not rare in a sample from both the neutral and putatively selected classes (see ref. 23 and Supplementary Materials 2). Applying this approach reveals a significant excess of divergence in UTRs and in most other classes of non-coding DNA relative to synonymous sites (Table 1; UTRs, p = 5 x 10-12; introns, p = 0.01; dIGRs, p = 0.04; introns + IGRs, p = 0.01). A Hudson–Kreitman–Aguadé (HKA) test [24] also provides statistical support for a reduced ratio of polymorphism to divergence for non-coding DNA relative to synonymous sites (UTRs, p < 10-3; pooled introns and IGRs, p = 0.02; see Supplementary Materials 2). Together, these results show that a significant fraction of the divergence in UTRs, introns and intergenic DNA was probably driven to fixation by positive selection.
To quantify the intensity and the relative importance of positive
selection in shaping the evolution of non-coding DNA, I apply two extensions
of the McDonald–Kreitman approach [2, 13].
First I estimate a, defined as the proportion
of the divergence between species that was driven by positive selection
[2]. I estimate that about 20% of the nucleotide divergence
in introns and intergenic DNA was driven to fixation by positive selection,
and about 60% for UTRs (Fig. 2a and Table
2). Using a hierarchical bayesian framework [13],
I estimate the selection intensity on non-coding DNA (including UTRs, introns
and IGRs) to be positive and significantly different from zero in most
cases (Fig. 2b; Supplementary
Materials 3). As this bayesian approach assumes that segregating and
fixed variants are subject to the same direction and intensity of selection,
it is likely to underestimate the magnitude of 2Nes (the intensity
of selection) for nucleotide substitutions fixed by positive selection
(see Supplementary Materials 2).
Figure 2: Quantifying adaptive divergence and selection intensity.
FIGURE 2. Quantifying adaptive divergence and selection intensity.
a, Estimates of a, the fraction of nucleotide divergence driven by positive selection. Error bars indicate 90% confidence limits determined by a non-parametric bootstrapping. Estimated probabilities that a>/= 0 corrected for partial linkage are given in Table 2.
b, Estimates of the intensity of selection (2Nes) acting on non-synonymous and non-coding DNA sites. Error bars indicate 90% confidence limits determined by simulation (see Methods). Singleton polymorphisms were excluded in estimates of a and 2Nes (see Supplementary Materials 3). Abbreviations as in Fig. 1.
Evidence that a significant fraction of non-coding DNA is functionally important is emerging from a variety of comparative genomic studies. However, my finding of a large fraction of positively selected divergence implies that 'evolutionary constraint' will substantially underestimate the fraction of functionally relevant nucleotides because it ignores the contribution of positively selected mutations to divergence. For the example of UTRs, I estimate evolutionary constraint to be 60%. However, as 58% of the observed divergence was positively selected, this implies that 83% of nucleotides in UTRs are in fact functionally relevant. Likewise, the fraction of functionally relevant nucleotides in introns and IGRs is likely to be about 10–20% higher than suggested by levels of constraint alone (Table 2).
How frequent is adaptation in the Drosophila genome? Rough
calculations (see Supplementary Materials
4) suggest that there has been about one adaptive amino acid substitution
every 20 years since the split of D. melanogaster and D. simulans
(see also ref. 2). Although this is substantial, consider
that the total number of sites contained in introns, intergenic regions
and UTRs far outweighs the number of codons in the Drosophila genome
[25]. I estimate that UTRs alone contribute as much
to adaptive divergence between species as do amino acid changes, and the
summed contribution of non-coding DNA to adaptive divergence could easily
be an order of magnitude larger. These findings support previous intuitions
[4, 5] about the great importance of regulatory changes
in evolution.
Methods:
Data:
All loci used in this study, previously published or newly collected, are X-linked genomic fragments, with a sample size of 12 D. melanogaster alleles sampled from a population in Zimbabwe, and a single D. simulans sequence. For coding DNA (synonymous and non-synonymous sites), I collected polymorphism and divergence in 31 coding regions selected randomly with respect to gene function, and 51 non-coding regions (27 intergenic and 24 untranslated transcribed regions). Information about these 82 loci and primers used can be found in Supplementary Materials 1. I used polymerase chain reaction (PCR) to amplify 700–800-bp regions from genomic DNA extracted from single male flies, removed primers and nucleotides using exonuclease I and shrimp alkaline phosphatase, and sequenced the cleaned product on both strands using Big-Dye (Version 3, Applied Biosystems). Sequences were collected on an ABI 3730 capillary sequencer and were aligned and edited using the program Sequencher (Gene Codes).
To the 82 regions surveyed above, I added previously published data
for loci that had the same sample size (n = 12 flies) and were surveyed
in similar samples from Zimbabwe [26, 27]. A number
of the previously published loci [26] had to be functionally
reassigned when compared to Release 4 of the annotated D. melanogaster
genome ( http://flybase.bio.indiana.edu/annot/dmel-release4.html
). I excluded any loci in regions of reduced recombination (see below).
Previously published loci fitting these requirements were processed into
106 fragments (4 coding, 7 UTR, 23 intergenic and 72 intron). Thus, the
total number of regions surveyed in this analysis is 188. Alignments for
each locus are available upon request. A reciprocal best-hit BLAST protocol
was used to confirm that the regions compared between D. melanogaster
and D. simulans are indeed orthologous. Extra gaps were introduced
into some alignments in regions that were particularly difficult to align.
This procedure is likely to upwardly bias estimates of constraint, but
is conservative with respect to detecting positive selection.
Analyses:
The estimated number of synonymous sites, non-synonymous sites, average pairwise diversity (p), average pairwise divergence (Dxy), as well as counts of the number of polymorphic sites (P) were performed using DnaSP software (version 4; http://www.ub.es/dnasp/ ) and Perl code written by P.A. The number of divergent sites (D) was estimated as Dxy - p using a Jukes–Cantor correction for multiple hits. Multiply hit sites were included in the analysis but insertion–deletion polymorphisms and mutations overlapping alignment gaps were excluded. Derived mutations were polarized using a single D. simulans sequence and assuming standard parsimony criteria. Tajima's D value [20] was estimated from the number of polymorphisms and p.
In this study, I assume that synonymous sites are more neutral than putatively selected classes of sites (see Supplementary Materials 2.2). I separated non-coding DNA into subclasses that I expected a priori to experience different selection pressures: 5' and 3' untranslated transcribed regions (UTRs), introns, intergenic regions within 2 kilobases (kb) of a gene (proximal intergenic regions, pIGRs), and intergenic regions further than 4 kb from the nearest gene (distal intergenic regions, dIGRs). My sample of intron fragments is biased towards introns larger than the median intron size (86 bp) (ref. 28), making estimates of constraint higher than expected with a random sample of introns [14]. However, 95% of intronic DNA is contained within introns longer than the median size [28], and thus my estimate reflects levels of constraint for most intronic DNA in the genome.
For comparisons of polymorphism and divergence between synonymous sites and non-coding DNA, it was necessary to pool sites in each class. I estimate evolutionary constraint relative to fourfold degenerate synonymous sites using the approach in ref. 15, except that I pooled classes of sites and used a Jukes–Cantor correction for multiple hits [19]. Given differences in base composition between coding and non-coding regions, I investigated possible differences in mutations rates owing to the 16 possible adjacent-base contexts of nucleotides (suggested by A. Kondrashov). There was no significant effect of adjacent-base context on rates of divergence (see Supplementary Materials 5).
I estimate the proportion of divergence driven by positive selection
[1, 2] as a = 1–(DSPX/DXPS),
where S denotes synonymous (that is, putatively neutral) sites,
X denotes putatively selected sites, and D=SUMnj=1Di
and P=SUMnj=1Pi,
where Di and Pi are the number of divergent
and polymorphic variants at locus i, respectively, and n
is the number of loci of class S or X. Confidence limits on a
were estimated using a standard non-parametric bootstrapping procedure,
assuming sites are independent. The issue of non-independence of sites
within surveyed fragments is addressed in Supplementary
Materials 2.5. For consistency, a was estimated
for non-synonymous sites in the same way. The intensity of selection (2Nes)
was estimated on putatively selected classes (pooling sites as above) using
a hierarchical bayesian method ( http://cbsuapps.tc.cornell.edu
) [13]. To avoid problems associated with large-scale
variation in recombination rates, I restricted my survey of loci to regions
of the X chromosome that have the highest rates of recombination [29]
(see Supplementary Fig. 1.1).
Acknowledgments:
The author thanks D. Bachtrog for extensive comments on the manuscript and help with data quality issues, C. Bustamante and K. Thornton for providing code, and B. Ballard for Zimbabwe fly lines. P. Haddrill and K. Thornton assisted in designing primers for distal intergenic and coding regions, respectively. Thanks to B. Fischman for technical help, A. Betancourt, A. Kondrashov, A. Poon, D. Presgraves, M. Przeworski and S. Wright for critical comments on the manuscript, and L. Chao and J. Huelsenbeck for advice. Thanks also to the Washington University Genome Sequencing Center for providing unpublished D. simulans sequences. This work was funded in part by a research grant from the Biotechnology and Biological Sciences Research Council (UK) to P.A. The author is supported by an Alfred P. Sloan Fellowship in Molecular and Computational Biology.
Competing interests statement: The author declared no competing interests.
http://www.nature.com/nature/journal/v437/n7062/suppinfo/nature04107.html
1. Kapranov P, Drenkow J, Cheng J, Long J, Helt G, Dike S, and Gingeras TR, "Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays".
2. Lee S, Bao J, Zhou G, Shapiro J, Xu J, Shi RZ, Lu X, Clark T, Johnson D, Kim YC, Wing C, Tseng C, Sun M, Lin W, Wang J, Yang H, Wang J, Du W, Wu C-I, Zhang X, and Wang SM, "Detecting novel low-abundant transcripts in Drosophila".
3. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, and Gingeras TR, "Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution".
4.
5. Storz G, Altuvia S, and Wassarman KM, "An Abundance of RNA Regulators".
6. Kuwabara T, Hsieh J, Nakashima K, Taira K, and Gage FH, "A Small Modulatory dsRNA Specifies the Fate of Adult Neural Stem Cells".
7. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, and Haussler D, "Ultraconserved Elements in the Human Genome".
8. Tiedge H, "RNA Reigns in Neurons".
Links to RNA and Biological Causality:
Links to
Euchromatin Activator RNA Reviews:
Links to
Euchromatin Activator RNA Research:
Links to Ultrastructural
Probes of DNase I-Sensitive Sites:
Links to
RNA as a Therapeutic Agent:
Links to Hodgkin Lymphoma
Immuno-Pathology:
Links to Activated
T-Lymphocyte Immunotherapy:
Links to Medical
Systems Biology:
Links to Selective
Gene Transcription:
Links to RNA-Induced
Epigenetics:
Links to RNA-Induced
Embryogenesis:
Links to RNA and
Biological Causality:
Links to Reprogramming
and Neoplasia:
A Brief History of Activator RNA:
"Ultrastructural
Probes of Active DNA Sites, and the RNA Activators of DNA". (PowerPoint
Presentation).