Modern approaches to genome sequencing

Jack Marsden
Oct 21, 2020
8 min read

Updated: Oct 27, 2020

The emergence of PacBio and ONT DNA sequencing and what it means for the science

The following is based on the article 'Long-read human genome sequencing and its applications' by G. A. Logsdon, M. R. Vollger and E. E. Eichler, first published in Nature Reviews Genetics on June 5 2020. Links to that article and other relevant reads can be found in the 'Read more' section below this post.

The ability to sequence a full genome of an organism (that is, discover and then write out the full genetic sequence of that organism in As, Gs, Ts, and Cs) is one of the most celebrated achievements of science in the last 20 years, culminating in the completion of the Human Genome Project in 2003, probably the most famous work in the field at least since the structure of DNA was discovered in 1953. This was a monumental achievement, requiring the labour of thousands of scientists from around the world at exorbitant costs, and resulted in a draft sequence of the full human genome (more than 3 billion base pairs). It was completed two years ahead of schedule for 2.7 billion USD (in 1992 money, when it was initially budgeted). And yet, it had more than 145,000 gaps, and regions of uncertainty. Before you throw your hands in the air and lose all faith in the abilities of the scientific community like a seven year old who just learnt why Santa is exactly the same height and build as grandpa, it’s worth seeing why this is, and where we are at now, 17 years later.

Traditional Sequencing Techniques – A whirlwind tour

To visualise the sequencing of DNA, it is worth looking at it like a puzzle. The entire genome, which in humans consists of 23 pairs of chromosomes, each chromosome a linear sequence of DNA with between 48 and 249 million base pairs in length, cannot at this stage be fully sequenced by itself. The technology does not exist just yet, but more on that later. What we instead must do is break the DNA into smaller pieces and then sequence those smaller pieces. Historically, the smaller these pieces are, the more accurate the read is likely to be. The common sequencing technique known as Illumina sequencing* (named after the Illumina machines on which it runs) reads sections of DNA just under 300 base pairs in length with 99.9% accuracy. So far, so good.

But this is where the puzzle analogy kicks in. Now we have to assemble those pieces back together and see the order they go in to create each chromosome. Now if you’ve ever done a puzzle, you know that the more puzzle pieces you have, the harder the puzzle is. To further complicate things, your puzzle in this scenario is one dimensional (only goes in a straight line), and the pieces usually overlap by varying amounts. You might even be missing a few pieces here and there. The good news is, supercomputers can fit most of the puzzle pieces together for you. Once you have fit a lot of the puzzle pieces together using your Illumina machine and supercomputer you are left with…more jigsaw pieces. They are substantially larger, but they certainly aren’t the length of an entire chromosome, so within any given chromosome you are still left with gaps and uncertainty. These larger jigsaw pieces are called contigs, and the longer and fewer contigs you have, the better you can reconstruct your chromosomes. For reference, the Human Genome Project had almost 150,000 contigs across its 23 chromosomes.

Perhaps the biggest reason that there is so much uncertainty in piecing the puzzle pieces together is the existence in the genetic sequence of sections referred to as ‘CpG islands’. These are long sections of the code which exist as dinucleotide repeats of CG (that is, the code in that region reads CGCGCGCG…), which extend from 300-3000 base pairs in length. Remember our initial puzzle pieces are up to 300 base pairs in length. So to extend our analogy, CpG islands create those puzzle pieces that are just blank white, and go somewhere in that section of clouds, or maybe that other section of clouds, but really who knows where and at this stage you just wish you hadn’t started this puzzle to begin with. CpG islands are very important in our genomes for reasons I’ll go over at another time, but they create huge challenges in whole genome sequencing (WGS). There are other regions of the genome containing long repeated sections which pose challenges, often in specific positions in the chromosome, particularly the telomeres (the ends of the chromosome) and the centromere (the region of the chromosome where it joins to its partner during cell division, usually near the middle). In fact, over 15% of the human genome is rendered inaccessible to genomic sequencing due to these repeats.

Long Read Sequencing technologies

A simple and somewhat obvious fix to our problem is to simply make the puzzle pieces bigger. This is very easy to say in our letter of complaint to the puzzle company, but this particular puzzle company has historically had a problem with larger puzzle pieces. The problem is this: where our Illumina machine had over 99.9% accuracy, traditionally technologies which sequenced longer reads have lower, often unacceptable levels of accuracy. That might now be all changing.

Two emerging long-read sequencing technologies are referred to as PacBio and ONT sequencing. Both are shorthand for the organisations which developed them: PacBio is the method used by Pacific Biosciences, and ONT stands for Oxford Nanopore Technologies. I’ll briefly talk about how each of them work, and then about how they represent the new horizons of genomic sequencing. The explanations of the mechanisms of these techniques necessarily involve some introductory technical language, so they can be skipped if you desire, or you can back yourself to make it through (I believe in you!). If you’d like to deepen your understanding of some of the more molecular concepts in genetics, I am working on plain English primers on molecular biology, with plenty of colourful pictures (the best part of biology).

PacBio, or SMRT sequencing

PacBio, also called SMRT (single molecule, real-time) sequencing, can sequence DNA strands between 1-100kb in length (kb stands for kilobases, 1kb = 1000 bases). The strands are connected to a circular DNA molecule template called a SMRTbell, and is then attached to an enzyme called DNA polymerase and then put on to a ‘SMRT cell’, which contains fluorescently labelled free nucleotides. The DNA polymerase enzyme, used by the body in DNA replication, attaches the free nucleotides to the inserted strand. Each time a new nucleotide is added, the fluorescent label is noted and then cleaved before the following nucleotide is added. Knowing the nucleotides which have been added, and their order, gives us the sequence of the original insert, as each nucleotide will only pair with one other nucleotide (A with T, G with C). So if we see that the added nucleotides have been AATTCGTA, we can use the pairing rule to surmise that our original sequence was (at that section) TTAAGCAT.

PacBio sequencing generates files called continuous long reads (CLRs), which are estimated to be 85-92% in accuracy. Importantly, the errors in these reads are random, computational tools known as polishing tools can be utilised to realise the true sequence of regions. In addition to this, PacBio can also generate HiFi sequence reads for target DNA 10-30kb in length, by utilising a technique called circular consensus sequencing (CCS). This essentially takes advantage of the strand’s shorter length to complete the process a number of times, creating multiple reads and then generating a ‘consensus’ read from the data.

ONT sequencing

Where PacBio sequencing used a circular SMRTbell on top, ONT sequencing attaches the target DNA segment to a linear strand called a sequencing adapter, which is attached to a motor protein (a motor protein is just a protein which assists in moving things). The motor protein is then attached to a tiny hole (called a nanopore) on a surface called a flow cell. The motor protein serves to unwind the DNA, and the DNA is driven through the pore using an electrical current (DNA is negatively charged, an attribute regularly taken advantage of by gene technology). The DNA, moving through the nanopore, disrupts the electrical current, and the exact disruption caused to the current by each individual nucleotide is specific to that nucleotide. Because of this, observing the disruptions to the current as the DNA passes through the nanopore allows scientists to get a read of the order of the nucleotides of the DNA sequence.

ONT produces ultra-long reads ranging from 10kb in length to several Mb in length (Mb: megabase. 1Mb = 1 million bases). The length of the reads in ONT is really only limited by the difficulty in preparing and treating DNA molecules of that size to be analysed. The accuracy of a single read is estimated at 87-92%, but can also be improved by polishing tools as in PacBio sequencing.

Implications of Long-Read sequencing

Recall the contigs, those larger jigsaw pieces constructed out of our initial smaller jigsaw pieces. Now that our initial jigsaw pieces are bigger, how much bigger are our contigs? Initial numbers show that contigs from PacBio and ONT sequencing efforts are bigger than Illumina contigs by a magnitude of over 100. Suddenly, our jigsaw gets a whole lot easier. The ultimate goal is to get a contig the length of a whole chromosome, telomere to telomere, and not only is this not too far away, it’s already been done. Miga and colleagues, in a study published in 2020, completed a sequence of the human genome comprised of 448 contigs (recall the original Human Genome Project had almost 150,000), including a single contig for the entire X chromosome. This study initially used ONT sequencing, but also used PacBio and Illumina sequencing methods to further polish the assembly of the genome. The final read was 2.94Gb (Gb: Gigabase, 1Gb = 1 billion base pairs) in length and is perhaps the best human genome template we have.

With long-read sequencing technologies we are able to gain a more accurate view of the human genome and all the nuanced structural variation it contains. The more accurate and detailed understanding of the human genome allows a greater understanding of complex inheritance patterns for countless diseases with genetic components, as well as the complex interactions between genes which create certain physical characteristics, including disability and disease. It will also allow us to gain a better understanding of the full range of genetic diversity in humans. Long-read sequencing is not limited to DNA, either. Long-read sequencing is not just limited to sequencing DNA, either. Long-read sequencing can be used to analyse the structure of mRNAs (RNA transcribed by the DNA to create proteins – read more on this in the molecular biology primer) and other characteristics such as epigenetic modifications (molecular markers attached to the DNA).

Overall, long-read sequencing technologies promise to provide us with clearer, more accurate pictures of our genomes, providing countless benefits to our understanding of how the molecules of life work to make us who we are.

I hope you learnt something new!

Thanks for reading,

Jack

https://www.nature.com/articles/s41586-020-2547-7 - article in Nature outlining the recent assembly of a complete X chromosome by Karen Miga and colleagues

https://www.edx.org/course/introduction-to-biology-the-secret-of-life-3 - online introductory biology course run by Eric Lander, one of the main academics associated with the Human Genome Project, in which he devotes a fair bit of time to discussing primitive sequencing methods and the process of the Human Genome Project

Genetics: Genes, Genomes, and Evolution (2017) by Philip Meneely, Rachel Dawes Hoang, Iruka Okeke, and Katherine Heston. DNA sequencing methods are outlined in chapter 3.

https://www.youtube.com/watch?v=RcP85JHLmnI - ONT's video on how their nanopore sequencing technology works

https://www.youtube.com/watch?v=_lD8JyAbwEo - PacBio's video on how their sequencing technology works

*The Illumina sequencing method was actually developed after the completion of the Human Genome Project, which largely used a method known as Sanger Sequencing. The Illumina sequencing method, how it works, and breakdowns of other sequencing methods such as Sanger sequencing will be the topic of a future article (or articles)

Genetics for Humans

Modern approaches to genome sequencing

Recent Posts

Comments