It took us 13 years to map the human genome. Today we can sequence DNA in six hours. How did we get there?
Whether you are a small clinical lab or a massive biotech research and development center, genome sequencing has never been faster or economical than it is now. We live in a time where at its most efficient, sequencing an entire genome can be performed in nearly five hours for patients in intensive care units.1 It was only 33 years ago when the Human Genome Project was launched with the goal of sequencing the entire human genome, an endeavor that took 13 years.2
In this article, we explore the expansive history of DNA sequencing methodologies and technology that brought our understanding of life to its current status.
From 1871-1929, scientists such as Friedrich Miescher, Walther Flemming, Albrecht Kossel, and Phoebus Levene laid the groundwork in cell chemistry and nucleic substances that paved the way for DNA sequencing. By 1929, it was established that nucleic acids were composed of five nitrogen-containing bases: adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U). The backbone of DNA consisted of alternating sugar and phosphate groups with a nitrogen-containing base, either A, T, C, or G, attached to each sugar molecule.3 While these findings are the foundation of the “Genetic Code” or “Code of Life”, at the time DNA itself was considered an unimportant, structural molecule while proteins were thought to be the genetic material due to their association with chromosomes and their complex structure.
In 1944, Oswald Avery, Colin MacLeod, and Maclyn McCarty published an article demonstrating DNA, not proteins, can transform cellular properties. Upon acceptance of this new dogma, there was a call to expand on the research being performed on DNA. By 1952, Erwin Chargraff and Linus Pauling determined much about the chemical properties concerning DNA along with the nature of the chemical bonds that bound the molecules together. Chargraff determined that in any organism the amount of G should equal the amount of C, while the amount of A should be equal to the amount of T.
As a chemist, Pauling established himself as a founding father of molecular biology for detailing the structures of proteins at the atomic level, including elucidating the structure of the most basic form of a protein chain, the alpha helix. By understanding the individual elements that make up larger protein molecules, Pauling could apply rules of chemistry and physics in determining how the pieces can be linked together, then tested his structural hypothesis with model kits. In 1952, Rosalind Franklin photographed her 51st X-ray diffraction pattern of DNA, providing the clearest image of the molecule yet. In turn, James Watson and Francis Crick were able to publish “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid” the following year.4
The technology to automatically read DNA sequences would not come for quite some time, but in 1965 groundwork was laid for establishing a DNA sequencing method following the publications of “Structure of a Ribonucleic Acid” by Robert Holley and “A Two-dimensional Fractionation Procedure for Radioactive Nucleotides” by Frederick Sanger.5,6 They focused on sequencing pieces of RNA, since it was considered less complex and easier to work with compared to DNA. RNA bacteriophages are not complicated by a complementary strand, and more readily bulk-produced in culture. Using enzymes to cut RNA chains at specific locations allowed for comparisons to be made with fully and partially degraded RNA fragments to determine the order of the bases present.7
A decade later, Sanger and Harvard scientists Allan Maxam and Walter Gilbert revolutionized genomics as they independently introduced their own methods of sequencing DNA. Ultimately, Sanger would refine his method in 1977 which would become the widely adopted approach to DNA sequencing, so much so that Applied Biosystems commercialized and automated the process by the following decade in 1987. The Sanger method follows three main steps using a double stranded DNA fragment as a template. First, target DNA is linearly amplified by incorporation of deoxy nucleotides (dNTPs – a mixture of A, T, C, and G). In that nucleotide mix, there are also fluorescently labeled dideoxy nucleotides (each ddNTP base has its own specific label) that end up terminating the nucleotide chain at each base position. Second, the DNA fragments of varying lengths are separated by gel electrophoresis.
Lastly, the order of nucleotides in the target DNA fragment is determined by a sequencing instrument reading the fluorescent tag (which corresponds to a specific base) on each DNA fragment. Initially the dideoxy terminator bases were radiolabeled with 32P, and four separate reactions were completed, but now the modified bases carry one of four fluorescent tags. The instruments made to perform this method of DNA sequencing would produce reads slightly less than one kilobase. To analyze longer fragments of DNA, researchers employed a technique known as shotgun sequencing. This technique is centered around breaking up the DNA into many fragments, sequencing all the fragments, then using computer software developed by Rodger Staden to properly align the generated set of gel readings into one contiguous sequence, also known as a contig. This method initially allowed Sanger and colleagues to complete the first whole genome of a bacteriophage, providing the 48,502 base pair(bp) DNA sequence in 1982, the same year GenBank, the U.S. National Institutes of Health’s sequence database became open to the public.
In 1984, the Department of Energy and the International Commission for Protection Against Environmental Mutagens and Carcinogens sponsored The Alta Summit Conference. This conference was initiated by David Smith and Mortimer Mendelsohn, asking attendees the following question, “Could new DNA sequencing methods permit direct detection of mutations, and could any increase in the mutation rate among survivors of the Hiroshima and Nagasaki bombings be detected?” Within the following year, Charles DeLisi, Robert Sinsheimer, and Renato Dulbecco separately recommended the idea of forming a concerted program to sequence the human genome to identify genes associated with certain diseases. In 1990, the Human Genome project was launched and was concluded in 2003. The international Human Genome Sequence Consortium published the euchromatic sequence of a human genome in 2004.8
The emergence of the second generation of DNA sequencing started following the introduction of pyrosequencers in 1996.9 The process was based on sequencing-by-synthesis technology, that measured the luminescence generated from enzymatic conversion of released pyrophosphate into light (Nat Protoc 2).10 The sequencing-by-synthesis (SBS) method refers to sequential addition and detection of individual nucleotides. Some of the major post-Sanger sequencing techniques include Roche/454, SOLiD, and SOLEXA sequencing technologies. In all these processes, the DNA is sheared prior to attachment of adaptors to the ends. During 454 sequencing, each fragment is attached to a bead, and the DNA on each bead is PCR amplified.11
For clonal amplification, primers anneal to single-stranded DNA templates and a DNA polymerase initiates a reaction generating multiple copies of the same sequence. During the sequencing reaction, each nucleotide is added sequentially, and the signal emitted by an individual nucleotide is quantified and classified. SBS was subsequently adopted by several sequencing platforms such as Illumina, IonTorrent, and Pacific Biosciences. The signal measured varies by instrument and includes release of pyrophosphate, fluorescence emission, or a change in electric current.
Roche/454 Life Sciences released the first high-throughput sequencer in 2005 and initiated large scale genome sequencing projects such as Neanderthal Genome Project to sequence the Neanderthal genome published in 2009. 12 “Jim,” the sequencing of the genome of James Watson, co-discoverer of the Watson-Crick structure of DNA, was later completed in 2007.13 Another method of SBS used by Solexa introduced sequencing where a single stranded DNA library is constructed and washed across a flow cell containing complementary oligonucleotides to one of the two adapter sequences on the DNA fragments.14
The process culminates in clusters of clonal populations from each original DNA strand after a process known as “bridge amplification”, replication of DNA arching over to prime the next polymerization of neighboring surface-bound oligonucleotides. Solexa sequenced the same bacteriophage – ΦX174 – that was first sequenced by Sanger using the Sanger sequencing method.15 The resulting run from Solexa SBS technology provided over 3 million bases from a single run. Solexa launched its sequencer – Genome Analyzer – in 2006 revolutionizing sequencing power with the ability to sequence 1 gigabase of sequencing data in a single run. Solexa was acquired by Illumina in 2007 and has allowed Illumina to become a leader short read (as the fragment-based methods are called) sequencing technology. Overall, the introduction of these second-generation (also called next-generation) sequencers significantly reduced the cost of sequencing, propelling scientific research and industrial applications of sequencing into a new era.
The third- and fourth-generation sequence methods comprise single-molecule technologies that can generate over 10,000 bp reads, called long read technologies, providing significant improvements in limitations due to de novo genome assembly and structural analysis. Third generation sequencing was conceptualized circa 2008-2009, when researchers gathered to share information about their latest technologies.16 Several commercial platforms are currently available using the third-generation DNA sequencing technologies such as Pacific Biosciences (PacBio) Single Molecule Real Time (SMRT) sequencing and Illumina Tru-seq Synthetic Long-Read technology. Fourth-generation technology, as designated by some, is represented by the Oxford Nanopore sequencing platform (17). Each of these methods can average between 5-15 kbp and can exceed 100,000 bp.18 The PacBio SMRT cell technology sequences DNA using a SBS method that detects fluorescent nucleotides as they are added to the template molecules. The instrument currently has produced ~100,000 bp with about 8 gigabases (Gb) of output per day. Although the error rate is high, several algorithms and the depth of sequencing of multiple template copies improves the accuracy geometrically to over 99.99%.19
Oxford Nanopore Technologies (ONT) utilizes a nanopore inserted in an electrically resistant membrane. The characteristic disruption in the current as bases pass through the pore is measured resulting in determination of specific single molecule sequences. 20 As the sequencing platforms have evolved, the DNA library preparation method has also changed. For ONT sequencing, a hairpin structure is ligated to the double stranded DNA, and the system can read both strands in one continuous read. ONT sequencing can provide ultra-long reads over 300 kb in length. 21 The technology also revolutionized the size of sequencers with the pocket-sized MinION that are highly portable and can be used without a sophisticated lab. This approach has provided a significant boost in identification and screening for diseases such as the Ebola and Lassa viruses. 22 The MinION was introduced for early access users in 2014 and made commercially available in 2015.
The identification and characterization of microorganisms has changed over time with the advancement of different tools and techniques. Not long ago, bacteria were classified based on the DNA-DNA hybridization method. 23 The method was laborious and was extremely limited without the possibility of creating a central database. Availability of sequencing technologies and the use of ribosomal genes, like the 16S rRNA gene, allowed creation of a central database for accurate microbial identification. With the increasing affordability of sequencing and algorithmic tools to analyze microbial databases, several organism-specific databases have been introduced and implemented that can even provide a strain-level identification.
Sanger sequencing has been instrumental in providing comparison data. With recent advancements in next-generation sequencing, researchers have been able to describe and study species and organisms that are unculturable in a standard lab setting. Whole genome sequencing using short and long reads, and rapid sequencing using ONT, have made significant impacts in microbiology research and development. We are on the cusp of implementing these technological advancements in an industrial setting, where scientific tools are being used to better understand microorganisms and provide rapid diagnostic and screening methods. Accugenix® is excited to see what the future holds with implementation of these newer techniques in microbial identification.
1. Armitage, Hamae. “Fastest DNA Sequencing Technique Helps Undiagnosed Patients Find Answers in Mere Hours.” News Center, 12 Jan. 2022. Stanford Medicine News.
2. Gibbs, Richard A. “The human genome project changed everything.” Nature Reviews Genetics 21.10 (2020): 575-576.
3. Singer, Maxine F. “1968 Nobel Laureate in Medicine or Physiology.” Science 162.3852 (1968): 433-436.
4. Watson, James D., and Crick, Francis H.C. “Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid.” Nature 171.4356 (1953): 737-738.
5. Holley, Robert W., et al. “Structure of a ribonucleic acid.” Science 147.3664 (1965): 1462-1465.
6. Sanger, Frederick, George G. Brownlee, and Bart G. Barrell. “A two-dimensional fractionation procedure for radioactive nucleotides.” Journal of molecular biology 13.2 (1965): 373-IN4.
7. Heather, James M., and Benjamin Chain. “The Sequence of Sequencers: The History of Sequencing DNA.” Genomics, vol. 107, no. 1, 2016, pp. 1–8.
8. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
9. Ari, Şule, and Muzaffer Arikan. “Next-generation sequencing: advantages, disadvantages, and future.” Plant omics: Trends and applications (2016): 109-135.
10. Tost, J., Gut, I. DNA methylation analysis by pyrosequencing. Nat Protoc 2, 2265–2275 (2007).
11. Rothberg, Jonathan M., and John H. Leamon. “The development and impact of 454 sequencing.” Nature biotechnology 26.10 (2008): 1117-1124.
12. Green, Richard E., et al. “The Neandertal genome and ancient DNA authenticity.” The EMBO journal 28.17 (2009): 2494-2502.
13. Wadman, Meredith. “James Watson’s genome sequenced at high speed.” Nature 452.7189 (2008): 788-789.
14. Ansorge, Wilhelm J. “Next-generation DNA sequencing techniques.” New biotechnology 25.4 (2009): 195-203.
15. Kircher, M., Stenzel, U. & Kelso, J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10, R83 (2009).
16. Check Hayden, E. Genome sequencing: the third generation. Nature (2009).
17. Feng, Yanxiao, et al. “Nanopore-based fourth-generation DNA sequencing technology.” Genomics, proteomics & bioinformatics 13.1 (2015): 4-16.
18. Amarasinghe, S.L., Su, S., Dong, X. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21, 30 (2020).
19. Chin, CS., Alexander, D., Marks, P. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10, 563–569 (2013).
20. Lu, Hengyun, Francesca Giordano, and Zemin Ning. “Oxford Nanopore MinION sequencing and genome assembly.” Genomics, proteomics & bioinformatics 14.5 (2016): 265-279.
21. Midha, M.K., Wu, M. & Chiu, KP. Long-read sequencing in deciphering human genetics to a greater depth. Hum Genet 138, 1201–1215 (2019).
22. Cao, Ying, et al. “Nanopore sequencing: a rapid solution for infectious disease epidemics.” Science China Life Sciences 62 (2019): 1101-1103.
23. Goris, Johan, et al. “DNA–DNA hybridization values and their relationship to whole-genome sequence similarities.” International journal of systematic and evolutionary microbiology 57.1 (2007): 81-91.
Ryan Cox works with Accugenix’s Microbial Databases and Technology Transfer team as an associate scientist. He assists in the ongoing curation and maintenance of the MALDI-TOF Library, developing strain typing assays and other various projects to help improve the workflow of the laboratories.
Sujan Timilsina is a part of the Accugenix Microbial Databases and Technology Transfer team, implementing development and curation of microbial databases across AccuGENX-ID platforms. He has over ten years of experience with microbial detection and identification, high throughput sequencing, and the development and application of bioinformatic pipelines for phylogenetically analyzing microbial evolution and movement across wide range of microorganisms.