When I was young, it was subtly noticeable that my brother had different eye color, hair colour and height to myself and our parents. So, I used to joke that he was adopted because from an early age without understanding any concepts of genetics we very intuitively get that we should look like our parents and share common traits with them. Since our first glimpse into the human genome in 2001, we have assumed genotype (DNA) drives phenotype (observable traits) and that since we inherit our parents’ DNA, we must inherit their traits as well. This is known as heritability: the degree to which genetic variation in a population can explain the variation of a trait in a population. And this degree can be calculated without actually understanding the exact mechanism by which it is inherited (i.e. what the genetic variation is). For example, in a cohort of identical twins, the concordance of a trait is a good estimate of that trait’s heritability like that of height which is estimated to be 80-90%. But the question remains, if we sequenced the human genome almost 20 years ago and know which traits and more importantly, which diseases are highly heritable, like Schizophrenia (H=80%), then why can’t we explain the genetic cause of these diseases yet?

Studies to identify genetic causes of diseases typically involve taking a large group of people with a disease and a large group of healthy people and sequencing all of their genomes. You then try and find what genetic variation in the sequencing data may be associated with the diseased group that isn’t present in the healthy group. These are known as genome wide association studies and often turn up single point mutations in one nucleotide of DNA that to date have only accounted for roughly 5% of the heritability of known traits and diseases. This failure to uncover the bulk of heritable disease-causing genetic variation can be explained by two phenomena.

Firstly, many complex traits and diseases arise from the product of hundreds, sometimes thousands of genes. And what happens in this highly dynamic, functionally redundant gene network is two different mutations into two different genes might result in the same functional outcome, meaning hundreds of mutations in one individual might have the same functional consequence as someone else with hundreds of unique mutations. This makes these association studies statistical conundrums since the number of people needed to uncover rare combinations of mutations in enough people for statistical inference is extremely high.

The second phenomenon is attributed to one of the most puzzling problems in modern biology: sequencing the genome. Despite the fact that genetic sequencing has exceeded Moore’s law in halving of cost and doubling in sequencing capacity each year, some parts of the genome remain elusive to sequencing and these parts likely account for most uncovered genetic variation that explains heritability. Genetic variation exists in many forms, sometimes point mutations in single nucleotides occur, but accounting for far more “mutated” DNA is when large stretches of DNA are inverted, duplicated or moved anywhere really. These mutations, although highly prevalent in all biological systems, are inherently undetectable by conventional sequencing methods for a few reasons:

  1. Conventional sequencing is biased by reference genomes that serve as a template resulting in large structural rearrangements going undetected.

  2. Large structural mutations are often flanked by highly repetitive DNA, which in sequencing terms, are essentially sequencing “deserts” and impossible to map sequencing information to. These regions remain unmapped and result in misorientation of the rearranged segment they flank.

  3. As diploid organisms, humans have two copies of all their DNA, one maternally inherited and the other paternally inherited. Conventional sequencing methods neglect maternal and paternal differences and converge both into a consensus sequence that may miss important genetic variation.

This is why I think genetic sequencing remains the most important biological problem of our time. When sequencing technology can fully capture the genetic sequence of living organisms with all their genetic variation it will finally be possible to explain what genetic variation accounts for most of the heritability of different traits and diseases. The ability to predict height from someone’s genes would be a pretty neat trick but what it represents is a level of mastery over the language of life that would undoubtedly overflow into many areas of science.