“It’s likely that 80 percent [estimate of functional human DNA] will go to 100 percent. We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful." — ENCODE lead researcher Ewan Birney.3
"If the human genome is indeed devoid of junk DNA as implied by the ENCODE project, then a long, undirected evolutionary process cannot explain the human genome... If ENCODE is right, evolution is wrong." — evolutionary biologist and atheist Dan Graur4
Humans and other mammals have about 3 billion "letters" of DNA. If evolution produced complex mammals like humans, we should expect most DNA to have no function, because:
- 3 billion DNA letters is much more function than evolution could produce, given even best-case scenarios.5 6
- Humans also get about 100 mutations per generation.7 Most mutations that have any effect on an organism are harmful.8 If even 10% of DNA has a specific sequence, then humans and other mammals would receive almost 10 new harmful mutations each generation, causing all offspring to have more harmful mutations than their parents.9 10 11 Function would perpetually decline with each generation. If evolution can't even preserve the function in our DNA, it could not have created it.
However it wouldn't make sense for a designer to create large amounts of non-functional DNA. The only junk DNA would be that which has been destroyed by mutations since the original design.
Most human DNA has not yet been tested to see if it has a function.12 But we can still extrapolate from known data:
- At least 85% of DNA is copied (transcribed) into RNA.13
- When and where DNA is copied to RNA occurs in specific patterns that depend on the cell type and the stage of development.14 12 13
- Among DNA copied to RNA transcripts in the human brain, at least 80% are taken to specific locations within their cells.14
- Enough RNA has been tested for function that we can "draw broader conclusions about the likely functionality of the rest."12
If at least 85% of DNA is copied to RNA, and at least 80% of those RNAs are taken to specific locations within cells, 85% * 80% = at least 68% of human DNA is used in a functional way. And likely much more because these are both lower-bound estimates. As function continues to increase as more DNA is studied, it is reasonable to think that perhaps even 99%+ DNA is in use.3 This is not to say that every DNA letter within these sequences requires a specific sequence.
Within a functional element of DNA, not all letters have to have a specific sequence. But how many do?
- If 85% of DNA is copied to RNA transcripts, 80% of those RNA transcripts are functional (transported to specific locations),14 and 66% of that RNA requires a specific sequence15, then these three numbers multiplied suggests at least 45% of DNA requires a specific sequence.
- At least 20% of DNA consists of either specific sequences where proteins bind to it, or instructions for making proteins (exons)16, and much known function that exists outside of protein binding spots and exons.
- About 95% of mutations that cause noticeable effects are outside of the 1-3% of DNA that creates proteins.17 18 From this (calculated below) we can extrapolate that at least 30% of DNA has a specific sequence.
- About 16% to 30% of either human DNA, or RNA copied from it, is shared (conserved) with distantly related mammals.19 Distant enough that if we assume they evolved from a common ancestor, then mutations would have had enough time to scramble this DNA if it were non-functional.
Since these are each under-estimates, and partially non-overlapping, the true amount of sequence-specific DNA is likely greater.
The data on the amount of functional DNA nullifies the argument that almost all DNA is junk and is therefore not designed.
Furthermore, at least 16-45% of DNA has a specific sequence, and likely much more. This implies humans get at least 16 to 45 harmful mutations per generation--far too many for evolution to even prevent declining function.
Evidence is what is expected in one view and not expected in competing views. Therefore the large amounts of functional DNA is evidence that organisms were designed.
The following sections outline this data in greater detail.
Rather than using only two categories of function and junk, this article takes a more nuanced approach:
- Sequence-Specific Function: These are nucleotides where a substitution mutation will alter the function or efficiency of a functional element, even if only very slightly. Sequence-specific function is a subset of the DNA that is in functional elements.
- Functional Elements: Functional elements are sequences of DNA that perform a specific function. Some of their nucleotides are sequence-specific functional while others can be swapped with no detectible effect on function. Or in more technical terms: a sequence of nucleotides that "affects the production, processing, or biological activity of a particular nucleic acid or protein, or the binding of other molecules."20
- Junk DNA: These nucleotides can be removed without degrading any functional elements.
Other definitions are often used in the debate over how much DNA is functional, but most are less precise. Because most genome studies focus on the human genome this article does same, as there is the most data is available there.
As defined above, functional elements have DNA where substitution mutations may or may not degrade function, but their removal would break function. Although "most elements in the human genome have not been subject to functional analysis,"12 we have four reasons to think the majority of human DNA is part of functional elements:
DNA is transcribed when it is copied to RNA. In 2013 genome researchers noted:
We found evidence that 85.2% of the genome is transcribed. This result closely agrees with [ENCODE's estimate of] transcription of 83.7% of the genome... we observe an increase in genomic coverage at each lower read threshold implying that even more read depth may reveal yet higher genomic coverage.13
ENCODE is an ongoing project by the United States National Institute of Health (NIH) to find function of the various sequences of human DNA. The project involves hundreds of scientists and hundreds of millions in funding. Prior to the more recent estimate of 85.2% transcription, ENCODE found in 2012:
The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.16
The following chart21 summarizes the results of various functional tests performed by ENCODE in 2012:
The third bar above (all RNA) shows that in 2012, at least 75% of DNA was found being copied to RNA. The tan regions in the first three bars corresponds to regions of DNA where about "one transcript copy per cell"21 was detected. Bars four through fourteen show the amount of DNA seen to be participating in other types of activity that implies function.
Many tests for function do not overlap other tests. This chart21 illustrates the intersections of some of the tests. Note that almost all DNA is colored by at least one test of function:
|Biochemical evidence: These blue regions show DNA that participates in the activities in the previous chart.
Genetic evidence: This green region shows DNA that causes a noticeable change in function if modified.
Evolutionary evidence: This red circle shows DNA that is the same in both humans and some distantly related mammals.
Protein-coding: This purple circle shows DNA that is directly used to make proteins.
However, ENCODE critics often counter-argue that DNA being copied to RNA and participating in the other biological activities is not enough evidence that DNA is functional. Genomicist Dan Graur is among these critics:
ENCODE ignores the fact that transcription is fundamentally a stochastic process. Some studies even indicate that 90% of the transcripts generated by RNA polymerase II may represent transcriptional noise. In fact, many transcripts generated by transcriptional noise exhibit extensive association with ribosomes and some are even translated
Dan Graur goes on to discuss why ENCODE's tests for histone modification, open chromatin, transcription factor binding, and DNA methylation alone are not necessarily indicators of function. Fair enough. But several additional evidences suggest transcribed DNA is likely functional:
Human development involves a single fertilized egg cell dividing into trillions of cells to eventually produce an adult human. These resulting cells come in many different types such as the various types of bone, muscle, and nerve cells. We see different cell types use different sections of DNA at different stages of development, in precise and reproducible patterns.
For example, Genome researcher John Mattick describes his observations of DNA being copied to RNA in the human brain and elsewhere:
Some are only expressed in the dentate gyrus of the hippocampus, others in particular layers of the cortex, and others in Purkinje cells in the cerebellum.14
[T]he vast majority of the mammalian genome is differentially transcribed in precise cell-specific patterns to produce large numbers of intergenic, interlacing, antisense and intronic non-protein-coding RNAs, which show dynamic regulation in embryonal development, tissue differentiation and disease with even regions superficially described as "gene deserts" expressing specific transcripts in particular cells... Assertions that the observed transcription represents random noise (tacitly or explicitly justified by reference to stochastic ("noisy") firing of known, legitimate promoters in bacteria and yeast), is more opinion than fact and difficult to reconcile with the exquisite precision of differential cell- and tissue-specific transcription in human cells.12
Other genome researchers have noted the same:
[T]he lincRNAs we identified have many characteristics that are inconsistent with noise, including specific regulation of their expression, the presence of conserved sequence and evidence for regulated processing. Furthermore, these lincRNAs are strongly enriched with intergenic sequences that were previously known to be functional in human traits and diseases.13
Independently, the developmental and tissue-specific expression of most ncRNAs provides perhaps the most compelling case for their widespread functionality. A study of ncRNAs expressed in mouse brain by in situ hybridization showed that the majority (623 out of 849) are selectively expressed in discrete functional regions of the brain, sometimes with evidence of specific subcellular localization. Moreover, expression signatures and dynamic regulation of hundreds of ncRNAs has been observed across tissue types and in various developmental systems, from Drosophila embryogenesis to differentiation of mammalian ES cells, T-cells and muscle cells 22 [Mattick is a co-author of this paper]
If "transcription is fundamentally a stochastic process" as Graur argues, we should not expect transcription to be so precisely regulated.
John Mattick observed that among non-coding RNA produced in the human brain, 80% of it was taken to specific locations within cells:
[I]n 80% of the cases where we had sufficient resolution to tell, these RNAs [in the human brain] are trafficked to specific subcellular locations. So this is not some fuzzy random signal: their expression is extremely precise, both in terms of the cell specificity and in terms of subcellular localization.14
John Mattick describes his experience:
In fact almost every time you functionally test a non-coding RNA that looks interesting because it's differentially expressed in one system or another, you get functionally indicative data coming out.14
[W]here tested, these noncoding RNAs usually show evidence of biological function in different developmental and disease contexts, with, by our estimate, hundreds of validated cases already published and many more en route, which is a big enough subset to draw broader conclusions about the likely functionality of the rest.12
But is Mattick merely just assuming this DNA is functional without proper investigation? Or perhaps mutations are merely activating nonfunctional DNA, causing the "disease contexts?" No. Rather, disabling this DNA causes loss of function, as can be seen in Mattick's numerous citations:
- "Knockdown of lincRNAs has major consequences on gene expression patterns, comparable to knockdown of well-known ES [embryonic stem] cell regulators."23
- "Decreased capacity for cell migration was also observed for SPRY4-IT1 knockdown." (although knocking it out also negatively affected melanoma cells)24
- "We identified lncRNAs required for neurogenesis. Knockdown studies indicated that loss of any of these lncRNAs blocked neurogenesis"25
- "Knockdown of MEN ε/β expression results in the disruption of nuclear paraspeckles."26
- "Knockdown of Zfas1 in a mammary epithelial cell line resulted in increased cellular proliferation and differentiation."27
It's also notable for Mattick to find this much function because "loss-of-function tests can also be buffered by functional redundancy, such that double or triple disruptions are required for a phenotypic consequence."21 This means that many genes will not produce a noticeable effect when disabled, because when one gene is disabled, cells often automatically enable other genes to do the same job.
Estimating Minimum DNA in Functional ElementsIf at least 85% of DNA is copied to RNA, and at least 80% of RNA transcripts are taken to specific locations within a cell, then we could estimate that perhaps 85% * 80% = at least 68% of DNA is within functional elements. And likely much more because these are both lower-bound estimates, not all DNA must be transcribed to have function, and not all RNA must be taken to a specific sub-cellular location to be functional.
As defined above, DNA that is sequence-specific has nucleotides where a substitution mutation will alter the function or efficiency of a functional element, even if only very slightly. We have several criteria by which to estimate how much DNA is sequence-specific:
We also know that "the nucleic acids that make up RNA connect to each other in very specific ways, which force RNA molecules to twist and loop into a variety of complicated 3D structures."28 This places many of the nucleotides transcribed to RNA in the sequence-specific category.
Below is a diagram15 with 13 examples of various known functional RNAs:
This list shows the number of cross-linked RNA nucleotides (connections where the RNA folds back on itself) per total nucleotides for each of the functional RNAs:
|MAT2A: 26 / 41
CTLA4: 12 / 15
FOS: 12 / 15
GRIA1: 26 / 34
CACNAD2D1: 22 / 37
|COL5A2: 14 / 21
UBE2W: 14/ 18
CLCN5: 28 / 47
BCL11B: 20 / 24
|MALAT1: 36 / 56
POP1: 40 / 62
long int. 19931: 60 / 86
long int. 16685: 62 / 110
Adding these gives 372 cross-linked nucleotides out of 566 total nucleotides, or 66%. From this we can make a lower-bound estimate of the percentage of nucleotides in functional RNAs that are sequence-specific: Since cross-linked RNA nucleotides require a specific sequence, this means at least 66% of the nucleotides within functional RNAs are sequence-specific. Although some of the others will be specific also.
Above we estimated at least 68% of DNA is within functional elements. Multiplying this number by the 66% give us at least 45% of DNA being sequence-specific. This should be treated as only a very rough estimate; a placeholder until better data is available.
The diagram above comes from a study looking at conserved DNA to find functional RNAs. Some may argue this introduces a bias, because RNAs with less specific sequences would be less likely to show up in a conservation study. However compare those above to a functional RNA such as MiRNA-95, where "its conservation is limited to the primate lineage and a few other higher mammals"29 but has 60 out of 7930 (76%) cross-linked RNA nucleotides.
ENCODE estimated in 2012 that 20% of the genome is made of exons or has a specific sequence that proteins use to determine where to latch on to DNA:
[E]ven with our most conservative estimate of functional elements (8.5% of putative DNA/protein binding regions) and assuming that we have already sampled half of the elements from our transcription factor and cell-type diversity, one would estimate that at a minimum 20% (17% from protein binding and 2.9% protein coding gene exons) of the genome participates in these specific functions, with the likely figure significantly higher.16
Because this paper previously mentions primate-specific DNA, some critics claim the 20% only applies to the regions of human DNA shared with other primates. However, leading ENCODE researcher Ewan Birney references the 20% in regard to the entire human genome:
A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20%.31
A review in 2012 looked at 920 studies involving the genomes of thousands of people17. They found that only 4.9% of function altering mutations occurred within protein-coding DNA (red slice in figure B):
Likewise another broad research review found only 4% of function altering mutations (105/2593) were within coding regions.18
From this we can estimate what percentage of DNA is subject to deleterious mutations.
Within coding regions, a 2011 study18 found 79 non-synonymous and 26 synonymous function altering mutations. About 30% of protein coding mutations are synonymous. If synonymous nucleotides were as functional as non-synonymous nucleotides we should expect 30%*(26+79) = 31.5 of them to be function-altering. But only 26 of them are functional, suggesting that they are only 26/31.5=83% as functional as non-synonymous sites.
At least 75%32 of amino-acid altering mutations within exons are deleterious, which implies that 75%83% =62.25% of mutations at synonymous sites are deleterious. Combining the synonymous and non-synonymous rates gives 75% 70% + 62% * 30% = 71.1% of mutations within exons being deleterious. Although one bacterial study estimated that as many as 95% of mutations within exons are deleterious.8
If 2% of DNA is protein coding, and only 4.9% of harmful mutations fall within protein coding DNA, that means non-coding DNA is 2% / 4.9% = 41% as specific in sequence as coding DNA. Coding DNA is 71% sequene specific, so about 71% * 41% = 29% of non-coding DNA nucleotides are sequence-specific.
Putting it all together: 2% x 71% + 98% x 29% = 30%. That means GWAS studies suggest at least 30% of DNA is sequence-specific.
This is an under-estimate because mutations within non-coding DNA likely to have a smaller effect than mutations in coding DNA, and are thus harder to detect.33 Thus It is likely that even fewer than 4-5% of function altering mutations are within coding DNA.
Some argue these numbers are invalid because genome-wide association studies may be showing false positives due to statistical noise. However, the mutations identified as function altering in the 2012 study above usually map to known regulatory features:
Fully 76.6% of all noncoding GWAS SNPs either lie within a DHS or are in complete linkage disequilibrium with SNPs in a nearby DHS.17
Other researchers have noted the same:
Of the 9184 GASs, 8733 (95.09%) and 5853 (63.73%) were mapped to at least one regulatory features, including promoter or enhancer, regulatory motifs, DNase footprinting sites, expression quantitative trait loci (eQTL) and conserved sequences, using the HaploReg and RegulomeDBdatabases, respectively. Moreover, many of these GASs were predicted to be associated with multiple regulatory features, suggesting their dynamic regulatory functions.34
Why do only a tiny fraction of nucleotides map to function discovered in genome-wide association studies?
- Human genomes are highly identical. Genome-wide association studies can only probe function where variation exists, and not enough genomes are yet available.
- Where variation does exist, multiple people with the same variation must be surveyed to rule out effects caused by environment, diet, or exercise.
- Most known traits have not yet been tested in genome-wide association studies.
- Many non-coding alleles exist in high copy numbers. A mutation disrupting only one of them would likely not show a phenotypic effect that would show up in a GWAS.
As ENCODE researchers explain: "At present, significantly associated loci explain only a small fraction of the estimated trait heritability, suggesting that a vast number of additional loci with smaller effects remain to be discovered"21
Assuming common descent is true: If two distantly related organisms have some of the same DNA, it is likely functional. Because otherwise over tens of millions of years, random mutations in each species would have changed the DNA enough that it is no longer recognizable as similar. In other words: whatever mutated that DNA didn't survive. Such DNA is called conserved DNA, or DNA that is under selection.
However, if organisms are designed and common descent is false, having sequences of DNA in common could also imply common design and therefore function. But without the assumption of common ancestry, shared sequences between not just distantly related, but all organisms would be used. Thus even the 95-96% of DNA shared between humans and chimpanzees35 would imply that 95-96% of DNA is sequence-specific.
In 2014, ENCODE researchers estimated that 16 to 26% of DNA (conserved+lineage specific) is preserved by natural selection:
[E]stimates that incorporate alternate references, shape-based constraint, evolutionary turnover, or lineage-specific constraint each suggests roughly two to three times more constraint than previously (12–15%), and their union might be even larger as they each correct different aspects of alignment-based excess constraint... Although still weakly powered, human population studies suggest that an additional 4 - 11% of the genome may be under lineage-specific constraint after specifically excluding protein-coding regions, and these numbers may also increase as our ability to detect human constraint increases with additional human genomes...21
Another study estimated at least 15.6-30% of DNA is conserved when DNA's resulting RNA structure is also taken into account:
Our findings provide an additional layer of support for previous reports advancing that >20% of the human genome is subjected to evolutionary selection...
the RNA structure predictions we report using conservative thresholds are likely to span >13.6% of the human genome we report. This number is probably a substantial underestimate... A less conservative estimate would place this ratio somewhere above 20% from the reported sensitivities measured from native RFAM alignments and over 30% from the observed sensitivities derived from sequence-based realignment of RFAM data...
the majority (87.8%) of the ECS predictions reported herein lie outside annotated sequence-constrained elements...
we can postulate a revised lower bound of functional sequence in the human genome at ~15.6%.19
Some studies of conserved mammal DNA estimate lower amounts, but they do not take shape and RNA structure conservation into account,36 do not include lineage-specific conservation, or they were done before more mammal genomes were available for study.
Furthermore, this is not to say that natural selection can actually maintain 16-30% of the DNA in a mammal. With this much function, harmful mutations would arrive faster than natural selection can remove them, as is described below in the Genetic Load section.
The above sections use four different criteria to very roughly estimate the lower-bound amount of sequence-specific DNA:
- From RNA structure + DNA in functional elements: At least 45%
- From protein binding + exons: At least 20%
- From disease + trait association: At least 30%
- From conservation: At least 16 to 30%
Since these are each under-estimates, and partially non-overlapping, the true amount of sequence-specific DNA is likely greater. It is therefore very unlikely that all four are wrong and the amount of sequence-specific DNA is less than these numbers.
Intelligent Design (ID) proponents predicted that most DNA would turn out to be functional. However evolutionary theory both predicts and requires that most DNA in complex organisms (with large genomes and low reproductive rates) will be junk. Two main reasons:
The neutral theory of evolution claims that most of the DNA differences between complex organisms is by chance, and not because natural selection favored organisms having those mutations. The reasoning is as follows:
- In typical populations of complex organisms, natural selection can only help spread or remove mutations that have a strong effect. Otherwise, random chance is the dominant factor in deciding which organisms reproduce and pass on their genes.
- Since most mutations have very little effect on an organism,37 8 neutral theory is a mathematical reality.38 "[T]he revolution is over. Neutral and nearly neutral theory won," as biologist PZ Myers described.39
Only about 5 to 10% of DNA is shared between humans and more distantly related mammals such as horses, dogs, and mice.40 Therefore if these animals all evolved from a common ancestor, given neutral theory, the large majority of DNA not shared by these mammals would have come about by chance and not natural selection. Since DNA that exists only by chance will have a random sequence, it is therefore highly unlikely to be functional. Evolutionary biologist and intelligent design critic Dan Graur explains:
...there exists a misconception among functional genomicists that the evolutionary process can produce a genome that is mostly functional.41
If the human genome is indeed devoid of junk DNA as implied by the ENCODE project, then a long, undirected evolutionary process cannot explain the human genome.4
This is not to say that mutations in almost all DNA would have no effect. The whole purpose of this article is to argue the opposite.
If evolutionary theory is true, then most DNA in complex organisms must have no function because a highly neutral evolutionary process could create nothing better. Therefore large amounts of functional DNA is good evidence against evolutionary theory.
Genetic load (also called mutational load) is the average number of deleterious (harmful) mutations per organism in a population. If the genetic load is too high, the population will not survive.
Organisms with more DNA generally have more mutations, since the number of errors increases as more is copied. If most of the DNA in large genome organisms (e.g. mammals) is functional, then these mutations will usually break important functions. Humans get about 100 mutations per generation.7
But if most DNA does nothing, or does not have specific information, then most mutations will be harmless. Therefore because of this genetic load problem, evolutionary theory both predicts and requires only a small amount of DNA in large genome organisms will be functional. Writing in 1972, Susumu Ohno was among other early geneticists who recognized this problem:
The moment we acquire 105 gene loci, the overall deleterious mutation rate per generation becomes 1.0 which appears to represent an unbearably heavy genetic load... Even if an allowance is made for the existence in multiplicates of certain genes, it is still concluded that at the most, only 6% of our DNA base sequences is utilized as genes9
By "genes," Ohno is referring to functional elements, not only protein coding genes.
The fact that a high deleterious mutation rate leads to decline has been tested rigorously using computer simulations. In one study with strong natural selection and only 10 harmful mutations per generation, fitness continued to decline perpetually.10
If evolution cannot even maintain these large amounts of functional DNA in complex organisms, then it could not have created it. Therefore large amounts of functional DNA is good evidence evolutionary theory is incorrect.
This section addresses various objections to the idea that most of the human genome is functional.
This objection is addressed in a companion article:
The C-Value Paradox
Some organisms have many times more DNA than others of similar complexity. Some argue that therefore most DNA in most complex organisms must be junk, and thus are not designed. This article provides data suggesting otherwise.
In 2014 when biologist Craig Venter's team built the first artificial yeast chromosome, they left out 20% of the DNA. The yeast still functioned just fine:
[T]here are over 50,000 base pairs that were either deleted, inserted or changed in that chromosome of 250,000 base pairs, and it works."42
However, like many other organisms, the yeast genome is known to be full of redundant fallback systems that kick in when primary systems fail. Among yeast protein coding genes, 80% are redundant. Physiologist Dennis Noble explains in a recent lecture, at 16:27:
"Simply by knocking genes out we don't necessarily reveal function, because the network may buffer what is happening. So you may need to do two knockouts or even three before you finally get through to the phenotype. ... If one network doesn't succeed in producing a component necessary to the functioning of the cell and the organism, then another network is used instead. So most knockouts and mutations are buffered by the network.... [at 19:40] Is this an unusual result, ... or is it general? This study went through all 6000 genes in the organism yeast. knocking them out one by one. 80% of the knockouts were silent. So this physiological process of buffering against gene change is general. It's usual in fact. Now that doesn't mean to say that these proteins that are made as a consequence of gene templates for them don't have a function. Of course they do. If you stress the organism you can reveal the function. .. If the organism can't make product X by mechanism A, it makes it by mechanism B."43
ENCODE, 2014 also remarked about how redundancy can allow functional DNA to be disabled or removed without noticing an effect:
Loss-of-function tests can also be buffered by functional redundancy, such that double or triple disruptions are required for a phenotypic consequence.21
Therefore removing a functional but redundant section of DNA will not have an effect on the phenotype.
Another possibility is that the sequences removed were once functional but had previously been disabled by mutation in wild-type yeast.
This objection depends on the assumption that all organisms were created by an unguided process of mutation and selection. In other words, if evolution is false then this objection is meaningless. An example of such reasoning:
- Humans and mice have about 10% of their DNA in common (conserved).
- Humans and mice evolved from a common ancestor that lived about 80 million years ago.
- Since evolution is a very slow process,5 6 it could not have created much additional function during those 80 million years.
- Therefore not much more than 10% of human and mouse DNA is functional.
Functional genome researchers recognize the fault of this reasoning:
[R]elative conservation imputes function, but lack of (discernable) conservation imputes nothing... differential expression (including extensive alternative splicing) of RNAs is a far more accurate guide to the functional content of the human genome than logically circular assessments of sequence conservation, or lack thereof.12
Functional sequences include but are not limited to sequences under purifying selection at the nucleotide level44
Since several known functional long ncRNAs, such as Xist and Air, are poorly conserved, it is evident that relative lack of conservation does not necessarily signify lack of function.22
A study in 2013 showed that it's very common for transcription factor proteins to bind to random strings of DNA45, suggesting it was careless of ENCODE to use protein binding to detect functional DNA. Biochemist and ID critic Larry Moran similarly argued: "We also expect that a lot more of the genome will be transcribed on rare occasions just because of spurious (accidental) transcription initiation."46
However a study in 2017 looked at places in DNA where proteins latch on, across 75 organisms including humans, mice, fruit flies, and yeast:
Using in vitro measurements of binding affinities for a large collection of DNA binding proteins, in multiple species, we detect a significant global avoidance of weak binding sites in genomes.47
This is significant because:
Most DNA binding proteins recognize degenerate patterns; i.e., they can bind strongly to tens or hundreds of different possible words and weakly to thousands or more.47
If proteins bind to DNA largely at random then we would expect to see mostly weak binding. But we see mostly strong binding. Therefore most DNA-protein binding indicates function.
However, protein binding and seeing at least 85% of the DNA copied to RNA don't stand alone as evidences of function. As mentioned above:
- DNA is usually copied to RNA in precise patterns depending on cell type and developmental stage.
- These RNAs are usually transported to specific locations within cells.
- When tested, these RNAs usually affect development or disease.
- Genome-wide association studies (GWAS) find ~95% of disease/trait associated alleles are outside protein coding DNA.
These four points are incompatible with transcription being the result of random, promiscuous binding. Likewise, non-coding DNA researcher John Mattick comments:
Assertions that the observed transcription represents random noise... is more opinion than fact and difficult to reconcile with the exquisite precision of differential cell- and tissue-specific transcription in human cells.12
Some argue that in the paper Kellis et al 2014,21 the ENCODE team recanted from their 2012 assertion that 80% of DNA is functional. However the paper merely lists various, often-non-overlapping lower-bound tests for function, some of which are less than 80%. For example they write:
Genome-wide biochemical studies, including recent reports from ENCODE, have revealed pervasive activity over an unexpectedly large fraction of the genome, including noncoding and nonconserved regions and repeat elements. Such results greatly increase upper bound estimates of candidate functional sequences.21
Likewise figure 2 in the paper shows the large majority of DNA covered with some evidence of function. The authors then contrast this with the evidence that if evolutionary theory is true, then most DNA cannot be functional. But they do not try to reconcile these two contradictory facts. What does the ENCODE team actually believe about function? Biochemist and junk DNA proponent Larry Moran commented on a 2015 ENCODE workshop:
The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another.48
About 44% of human DNA is made up of transposons and transposon-like repetitive DNA elements.49 Some have argued that hundreds of millions of years ago, the genomes mammal ancestors were much smaller. Over time, transposons duplicated themselves throughout mammal genomes. Those that created copies of themselves within frequently used regions of DNA copied themselves faster than those that didn't. Therefore these transposons are only transcribed in developmental-stage and cell-type specific patterns because the DNA regions where they inserted themselves were already specifically regulated.
Yet this idea fails because transcription often begins within transposons themselves. The transposons are often providing the transcription start sites, not hijacking them:
up to 30% of human and mouse transcription start sites (TSSs) are located in transposable elements and that they exhibit clear tissue-specific and developmental stage–restricted expression patterns50
TEs, and in particular ERVs, have contributed hundreds of thousands of novel regulatory elements to the primate lineage.51
Moreso, evolutionarily successful transposons would only need to integrate themselves into regions actively transcribed in germline cells (since that DNA is inherited), while integrations into actively transcribed regions in other cell types wouldn't matter. Yet we see these sequences being transcribed not primarily in regions active in germlince cells, but in all cell types. Thus this is also inconsistent with mosttransposon-like elements in our genomes originating from selfishly copying themselves.
This idea of transposon co-option is also incompatible with other observations discussed above: transcripts are usually taken to specific sub-cellular locations, and when tested they usually affect development or disease.
A future article will give an overview of the known functions of various types of DNA elements. For now:
The full regulatory potential of transposon-derived sequences is unknown, but appears to include a range of functions. They are largely transcribed in a regulated manner, and feature promoters to drive specific expression. Important roles have also been demonstrated for Alu-derived RNAs in the regulation of RNA polymerase II during heat shock and the regulation of alternative splicing, translation and mRNA stability]. Similarly, it has been shown that transcripts derived from retrotransposons can regulate chromatin structure in transposon-rich regions such as centromeres and neocentromeres, that LINE L1 retrotransposition can mediate somatic mosaicism in neuronal precursor cells, and that the transcription of inverted repeats that serve as boundary elements can influence gene expression. Moreover, a recent study has shown that 6–30% of cap-selected mouse and human RNA transcripts initiate within repetitive elements, that approximately 250,000 of these transcripts are generally tissue specific, and that transposon-derived sequences located immediately upstream of protein-coding loci frequently function as alternative promoters and/or express non-coding RNAs, identifying some 23,000 candidate regulatory regions derived from retrotransposons. In addition, repetitive sequences may be included as parts of larger transcripts, including many ncRNAs whose functions have been demonstrated, such as Xist, Air, Kcnq1ot1, BORG, DISC2, NTT and Xlisrts, suggesting that these elements may be functional modules common among ncRNAs as well as mRNAs.22
These objections are less common, so they are covered more briefly:
"Genome size is negatively correlated with effective population size:"
Dan Graur raised this objection on slide 46 of a presentation in 20134, citing a 2005 study of ray-finned fish.52 This correlation would be predicted by selfish gene theory because "evolution can only produce a genome devoid of 'junk' if and only if the effective population size is huge and the deleterious effects of increasing genome size are considerable"41. However Graur's argument was addressed by T. Ryan Gregory in 2008, who showed that fish genome sizes would have been set long before fish were divided into their current population sizes:
it is apparent that genomes reached their current sizes in most fishes long before contemporary microsatellite heterozygosities were shaped53
"ENCODE used transcriptionally abnormal cell lines:"
In 2013 Dan Graur argued "ENCODE used almost exclusively pluripotent stem cells and cancer cells, which are known as transcriptionally permissive environments"41 But this misconception was corrected by an ENCODE researcher:
The vast majority of the RNAseq was performed in GM12878 or K562 (but we looked at a lot of different cell types, see here... That assertion is completely false and I don't want that misinformation being spread... We used K562 because it is transcriptionally normal.54
Because of such problems, one researcher remarked of Graur's 2013 paper:
Graur wrote such a negative paper that it was hard to read... Graur’s criticism is so over-the-top that it’s not worthy.55
"The Stat3 transcription factor indicates promiscuous protein binding:"
Dan Graur also made this argument in 2013.41 The problem here is that Graur merely assumes the remaining possible binding sites were non-functional, even though they were never tested. Among the 14 that were tested, 12 (86% of them) were functional. This is particularly notable since Stat3 is less specific than other transcription factors in what it binds to:
we scanned the whole mouse genome sequence finding a total of 1,355,858 putative binding sites. Such a high number is not unexpected taking into account the loose sequence requirements for Stat3 binding.56
In Why repetitive DNA is essential to genome function (Biol Rev, 2005), James Shapiro and Richard Sternberg identify over 80 roles for repetitive sequences.
Micromanagers: new classes of RNAs emerge as key players in the brain (Science News, 2008) provides an overview of many known functions of non-coding RNA.
Reddit AskScience interview with the ENCODE Team:
One researcher wrote:
That said, I can’t help but notice a trend: over time, 'junk DNA' is disappearing. Good riddance: this is just a term for DNA that we don’t have any guesses about its function. The more we learn about the genome, the more functions we uncover, thus fewer unknowns and a more seemingly 'useful' genome. Where will it end? I have no idea, but many people are looking (though more are always needed!).
Another researcher wrote:
It's been my experience that anytime we have to quibble over a soft definition, we are missing a larger point. A better question would be, what percentage of the DNA sequence could be removed with ABSOLUTELY NO effect? I would guess that it's very, very small.
That comment was later deleted. Biochemist and prominant junk DNA proponent Larry Moran also became involved in the discussion.
In this thread from 2013, an ENCODE scientist defends the ENCODE conclusions in a debate with an evolutionary biologist.
Five Things You Should Know if you Want to Participate in the Junk DNA Debate, 2013. Biochemist and ID critic Larry Moran outlines why he thinks most DNA is true junk. Most of his reasons are based on the idea that evolution couldn't have produced anything better.
In December of 2014, well known genomicist and ID critic Dan Graur (cited frequently here) wrote a blog post about an earlier version of this article.
Fine, Steven. "Pufferfish at the Audubon Aquarium of the Americas." Wikipedia. 2016.Wikipedia says the image is licensed under the Creative Commons Attribution-Share Alike 4.0 International license. ↩ ↩
Dodson, Edward O. "Note on the Cost of Natural Selection." The American Naturalist. 1962.JBS Haldane was an evolutionary biologist well known for his work in developing the modern evolutionary synthesis. Calculating the implications of Haldane's model, Dodson explains: "Haldane (1957) has published calculations which indicate that it takes no less than 300 generations to replace a gene by ordinary selection pressures, and that this evolutionary process cannot be speeded up by simultaneous selection for more than one gene....we arrive at a maximum of something over 200 gene substitutions over the past million years for the genus Homo." Since on average the mutation rate is the fixation rate, 300 generations with about 100 mutations per generation would give 30,000 neutral mutations per 1 beneficial mutation that fixes. 29,999 / 30,000 is 99.997% of fixed mutatiosn being neutral.
Mirrors: Local screenshot ↩ ↩ ↩
Moran, Larry. Comment on "Breaking news: Creationist Vincent Torley lies and moves goalposts." Sandwalk Blog. 2014.Joe Felsenstein is a well known population geneticist who has published a criticism of Haldane's limit. When he and biochemist Larry Moran (both Intelligent Design critics) were asked to estimate the number of beneficial versus neutral differences between human and chimps, they replied: "Updated numbers suggest 44 million point mutations and something like 2 million insertions/deletions for a grand total of 46 million mutations. We don't know how many of those were beneficial (adaptive) leading to ways in which modern chimps are better adapted than the common ancestor. (Same for humans.) My guess would be only a few thousand in each lineage." 1 - 3000 / 23,000,000 is 99.987% of fixed mutations being neutral. ↩ ↩ ↩
Lind, Peter A. et al. "Mutational Robustness of Ribosomal Protein Genes." Science. 2010.The authors tested two ribosomal proteins in salmonella typhimurium and found: "most mutations (120 out of 126) are weakly deleterious and the remaining ones are potentially neutral." The authors did not detect any beneficial mutations. They suggest their study is more sensitive than others because "Mutagenesis studies of single proteins rarely include the use of high-sensitivity assays of fitness and analysis of synonymous substitutions (SOM references)... it is conceivable that the relatively high frequencies of apparently neutral mutations observed in certain experimental systems are mainly a consequence of the limited sensitivity of the assays and that the proportion of deleterious mutations is very high even when synonymous substitutions are included." ↩ ↩ ↩
Gibson et al. "Can Purifying Natural Selection Preserve Biological Information?" World Scientific. 2011. ↩ ↩
Hangauer et. al. "Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs." PLoS Genetics. 2013.Mirrors: Archive.is | Local excerpt with notes ↩ ↩ ↩ ↩
"New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes." Genome Research. 2011.See the diagrams of functional RNAs in figure 5. 66% of nucleotides are cross linked and thus at least that much has a specific sequence.
Mirrors: Archive.is ↩ ↩
The ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in the human genome." Nature. 2012. ↩ ↩ ↩
Maurano, Matthew T. "Supplementary Materials for Systematic Localization of Common Disease Associated Variation in Regulatory DNA." Science. 2012.See figure S1 for the pie chart of SNP locations. The caption states "only 4.9% of GWAS SNPs lie in coding sequence." The study says "coding regions were defined by the CCDS Project (downloaded from the UCSC genome browser at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ccdsGene.txt.gz on March 5, 2011" ↩ ↩ ↩
Freedman, Matthew L. "Principles for the post-GWAS functional characterization of cancer risk loci." Nature Genetics. 2011.See table 1 in the paper. Note that a pre-print version of this study listing Alvaro N.A. Monteiro as the lead author does not include the table.
Mirrors: local screenshot | Archive.is ↩ ↩ ↩
Behe, Michael J. "Experimental evolution, loss-of-function mutations, and "The first rule of adaptive evolution." Quarterly Review of Biology. 2010.Michael Behe defines a "functional coding element" as "a discrete but not necessarily contiguous region of a gene that, by means of its nucleotide sequence, influences the production, processing, or biological activity of a particular nucleic acid or protein, or its specific binding to another molecule."
Mirrors: Michael Behe's LeHigh U. Faculty Page | Scribd | Archive.is ↩
Kellis et.al. "Defining functional DNA elements in the human genome." PNAS. 2014.Figures #1 and #2 show detected functional DNA by different methodologies.
Mirrors: Archive.is | Local excerpt with notes ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩
Marcel E. Dinger et al. "Pervasive transcription of the eukaryotic genome: functional indices and conceptual implications." Briefings in Functional Genomics. 2009. ↩ ↩ ↩
Mitchell Guttman et al. "lincRNAs act in the circuitry controlling pluripotency and differentiation." Nature. 2011. ↩
Divya Khaitan et al. "The Melanoma‐Upregulated Long Noncoding RNA SPRY4-IT1 Modulates Apoptosis and Invasion." Molecular and Cellular Pathobiology. 2011. ↩
Hongjae Sunwoo et al. "MEN ε/β nuclear-retained non-coding RNAs are up-regulated upon muscle differentiation and are essential components of paraspeckles." Genome Research. 2009. ↩
Marjan E. Askarian-Amiri et al. "SNORD-host RNA Zfas1 is a regulator of mammary development and a potential marker for breast cancer." RNA. 2011 ↩
New insight into the human genome through the lens of evolution." Garvan Institute. 2013.Mirrors: Archive.is Archive.org ↩
Frankel, Lisa B et al. "A non-conserved miRNA regulates lysosomal function and impacts on a human lysosomal storage disorder." Nature. 2014.Mirrors: Archive.is ↩
Axe, Doug. "Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds." J. Mol Biol. 2004. Doug Axe (an intelligent design proponent) tested a protein with degraded function to see how many amino acids could be replaced: "about one in four random single-residue [amino acid] changes are functionally neutral. The proportion would be somewhat lower under conditions requiring a higher level of function." ↩
Mattick, John S. "The Genetic Signatures of Noncoding RNAs." PLOS Genetics. 2009.The author states: "Most variations in regulatory sequences produce relatively subtle phenotypic changes, in contrast to mutations in protein-coding sequences that frequently cause catastrophic component failure."
Mirrors: Archive.is ↩
Chen, Geng. "Re-annotation of presumed noncoding disease/trait-associated genetic variants by integrative analyses." Scientific Reports. 2015.Mirrors: Archive.is ↩
Preuss, Todd M. "Human brain evolution: From gene discovery to phenotype discovery." PNAS. 2012. ↩
Eyre-Walker, Adam et al. "The distribution of fitness effects of new mutations." Nature. 2007.The authors state: "relatively few amino-acid-changing mutations have effects of greater than 10% in humans, and that most have effects in the range of 10-3 and 10-1" Mutations in synonymous regions and non-protein coding genes would have even less of an effect.
Mirrors: Archive.org | Archive.is ↩
Lynch, Michael. "The Origins of Eukaryotic Gene Structure." Mol Bio Evol. 2006.Michael Lynch is a highly respected population geneticist. He writes: "The neutral (or nearly neutral) theory that emerged from this work still enjoys a central place in the field of molecular evolution" and "it is difficult to reject the hypothesis that the basic embellishments of the eukaryotic gene originated largely as a consequence of nonadaptive processes operating contrary to the expected direction of natural selection."
Mirrors: Archive.is ↩
Myers, PZ. "The state of modern evolutionary theory may not be what you think it is." Pharyngula Blog. 2014.PZ Myers is a developmental biologist well known for his criticisms of Intelligent Design. Myers writes: "the revolution is over. Neutral and nearly neutral theory won."
Mirrors: Archive.is ↩
Meader et al. "Massive turnover of functional sequence in human and other mammalian genomes." Genome Res. 2010.Figure 2 shows how much DNA (of 3 billion base pairs total) is shared between humans (homo), macaca monkeys, mice (mus), rats (rattus), horses (equus), cows (bos), and dogs (canis).
Mirrors: Archive.is ↩
Graur, Dan. "On the Immortality of Television Sets: 'Function' in the human genome according to the evolution-free gospel of ENCODE." Genome Biology and Evolution. 2013.Mirrors: Archive.org | Local excerpt with notes ↩ ↩ ↩ ↩
Niu, Deng-ke. "Can ENCODE tell us how much junk DNA we carry in our genome?" Biochemical and Biophysical Research Communications. 2013. ↩
White, Michael A. "Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks." PNAS. 2013.The lead author summarizes the paper on his blog. (Archive.is). He also tweeted: "One of our Nature reviewers said this paper should not be published in any journal, ever." ↩
Qian, Long and Edo Kussel. "Genome-Wide Motif Statistics are Shaped by DNA Binding Proteins over Evolutionary Time Scales." Physical Review X. 2016. ↩ ↩
Mills. Ryan E et al. "Which transposable elements are active in the human genome?" Trends in Genetics. 2007. ↩
Fort, Alexandre et al. "Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance." Nature Genetics. 2013.Mirrors: Ohio.edu | Local screenshot ↩
Jacques, Pierre-Étienne et al. "The Majority of Primate-Specific Regulatory Sequences Are Derived from Transposable Elements." PLOS Genetics. 2013. ↩
S, Yi. "Genome size is negatively correlated with effective population size in ray-finned fish." Trends Genet. 2005. ↩
Gregory, T. Ryan. "Population size and genome size in fishes: a closer look." Genome. 2008. ↩
Reddit user west_of_everywhere. "Comment on How come there's a Amoeba with 200 times larger gene set than humans?" Reddit. 2013. ↩
Vallania, Francesco et al. "Genome-wide discovery of functional transcription factor binding sites by comparative genomics: The case of Stat3." PNAS. 2009. ↩