DNA Sequencing

From Nucleowiki
Revision as of 16:56, 6 January 2025 by Richert (talk | contribs) (Created page with "DNA Sequencing DNA sequencing is the process by which a nucleotide sequence in deoxyribonucleic acid (DNA) is determined. As such, it has applications in many fields of science, and is relevant for the development of gene therapy or the research into extinct species (Mathews, 2024). History of DNA Sequencing After the structure of the DNA double helix had been uncovered in 1953 (Watson & Crick, 1953), researchers turned their attention to determining the nucleotide seq...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

DNA Sequencing

DNA sequencing is the process by which a nucleotide sequence in deoxyribonucleic acid (DNA) is determined. As such, it has applications in many fields of science, and is relevant for the development of gene therapy or the research into extinct species (Mathews, 2024).

History of DNA Sequencing After the structure of the DNA double helix had been uncovered in 1953 (Watson & Crick, 1953), researchers turned their attention to determining the nucleotide sequence of DNA. Early efforts included the sequencing of yeast transfer ribonucleic acid (tRNA) and the analysis of single stranded ends of bacteriophage DNA (Booth, 2022, Mathews, 2024). In 1977, the identification of longer sequences of DNA became practicable with the advent of two new methods: The chemical cleavage procedure, developed by Allan Maxam and Walter Gilbert (Maxam & Gilbert, 1977), and the chain termination procedure, developed by Frederick Sanger (Sanger et al., 1977). These two methods are often referred to as 'first-generation sequencing' methods.

First-Generation Sequencing The Maxam-Gilbert procedure involves a set of four chemical reactions, each able to cleave the glycosidic bond of a specific nucleobase. Guanine can be methylated using dimethyl sulfate, followed by cleavage of the glycosidic bond under heating. Adenine can be methylated using dimethyl sulfate, followed by glycosidic bond cleavage under incubation with dilute acid. Thymine can be cleaved via treatment with hydrazine, whereas cytosine must be treated with hydrazine at high salt concentrations. Treatment of the abasic site in the backbone with piperidine as base will lead to strand cleavage at this site. The resulting DNA fragments are of unique length and can be separated via polyacrylamide gel electrophoresis (PAGE). The order of the bands in the gel corresponds to the nucleotide sequence (Booth, 2022). In contrast, Sanger sequencing utilizes a DNA polymerase to form a complementary strand to the template region of interest. The polymerase is provided with all four 2’-deoxynucleotides (dNTPs), and one type of 2’,3’-dideoxynucleotide (ddNTP). This ddNTP is a chain terminator - upon incorporation of a ddNTP into the growing DNA strand, extension cannot continue (Mathews, 2024). Thus, DNA strands of different lengths are formed, each terminating at the position of the ddNTP incorporation. Separation via PAGE, with separate lanes for each ddNTP-containing solution, yields the order of nucleobases in the strand. Over the following decade, two innovations further increased the efficiency of the Sanger procedure. First, by marking the ddNTPs with different fluorescent labels, electrophoresis of all four ddNTPs could be carried out in a single capillary. Second, so-called shotgun sequencing enabled analysis of long sequences. By breaking up DNA via sonication or endonucleases, multiple overlapping fragments are created, so-called 'reads'. Reads were sequenced separately, and the full sequence assembled via analysis of the overlaps (Booth, 2022, Heather & Chain, 2016). This enabled sequencing of complete genomes, culminating in the deciphering of the human genome in 2001 (International Human Genome Sequencing Consortium, 2004). As noted by Shendure et al. (2017), the increasing amount of genetic data also necessitated the development of algorithms and databases. Noteworthy innovations include the Smith-Waterman Algorithm (Smith & Waterman, 1981), BLAST (Altschul et al., 1990) and GenBank (Benson et al., 1993).

Second-Generation Sequencing Over the course of the 1990s, multiple technologies further increased the throughput achievable. They are generally referred to as 'Second-Generation Sequencing Technologies' or 'Next Generation Sequencing Technologies' (Liu et al., 2012). Many of these were also commercialized as stand-alone platforms. Notable platforms include the pyrosequencing platform of 454 Life Sciences, the Illumina platform (initially Solexa), the SOLiD system from Applied Biosystems and the Ion Torrent technology of Life Technologies (Heather & Chain, 2016). The following is a brief description of each of the platforms, with an attempt to highlight advantages and disadvantages. The pyrosequencing workflow begins by attaching single stranded (ss) DNA fragments to beads and amplifying them via emulsion PCR (emPCR) (Heather & Chain, 2016). Then, one type of dNTP at a time is washed over the beads, together with a DNA polymerase. If the dNTP is incorporated into the strand, pyrophosphate (PPi) is released. Pyrophosphate is used by ATP sulfurylase to produce ATP, which in turn is converted by the luciferase to produce light. Lastly, apyrase degrades ATP and dNTPs in between the addition of bases. Thus, if a flash of light can be detected, a dNTP has been incorporated into the sequence (Ahmadian et al., 2006). Pyrosequencing can be used for de novo gene assembly, boasting long read lengths and high sequencing speeds. However, the length of homopolymeric regions, (that is, regions which contain repeats of just one nucleotide) is difficult to determine using pyrosequencing (Mathews, 2024). Also, the cost per base is comparatively high (Liu et al., 2012). The acronym SOLiD is shorthand for 'Sequencing by Oligo Ligation Detection' (Heather & Chain, 2016). Again, ssDNA fragments are amplified in emPCR and attached to a solid support. Then, they are exposed to different degenerate DNA oligomers. Each oligomer carries a fluorophore, the colour of which corresponds to one base of the oligomer. The other bases are degenerate and can bind to any other base. A T4 DNA ligase ligates the oligomer to the primer, and via detection of the fluorescence, the identity of the ligated base can be determined (Shendure et al., 2005). Accuracy of the SOLiD method is very high, making it useful for resequencing of genomes, for example to detect single nucleotide polymorphisms. However, read lengths are short, making assembly of longer sequences difficult (Liu et al., 2012). The Ion Torrent procedure also begins by amplifying ssDNA fragments on a bead via emPCR. Beads are sequentially exposed to dNTPs, one type at a time, and a DNA polymerase extends the strands. Every time a triphosphate group is hydrolyzed, a proton is released, which changes the pH around the bead. This is detected and digitized by a sensor, indicating incorporation of a certain type of dNTP (Rothberg et al., 2011). While this method is fast, has low cost per base and is easy to scale, but accuracy in homopolymeric regions is low (Mathews, 2024). In Illumina sequencing, DNA fragments are immobilized and amplified on a solid surface in a process called 'bridge amplification'. Then, fragments are exposed to dNTPs, each base marked by a different fluorescent molecule. The 3’-OH group of the dNTPs is protected by an azidomethyl group, leading to chain termination upon incorporation. So, only a single dNTP can be incorporated at a time. The type of dNTP can now be identified by four-colour imaging. Lastly, fluorescent groups and the 3’ protecting groups are cleaved, so that a new dNTP can be incorporated. This process can be repeated for read lengths of up to 300 bases (Booth, 2022). The Illumina protocol allows for accurate, high throughput sequencing of DNA. Read lengths are short, compared to the other platforms (Liu et al., 2012). Still, as of September 2024, the Illumina method is the most common sequencing method in the world (Mathews, 2024).

Third-Generation Sequencing While second-generation sequencing methods are improved, as compared to first generation methods, in many ways, they still have some shortcomings. Firstly, all second-generation platforms rely on DNA amplification. This not only adds a layer of complexity but can also introduce errors and biases (Pinard et al., 2006). Secondly, read lengths of all platforms are less than 1000 bp (Mathew, 2024). The human genome contains repeating regions of up to 104 bp, meaning these regions cannot be sequenced with second-generation methods (Li & Freudenberg, 2014). Third-generation sequencing technologies try to address these problems. They rely on single molecule sequencing, meaning sequencing of long reads without previous amplification (Heather & Chain, 2016). Notable technologies include the single-molecule real-time sequencing (SMRT) method from Pacific Biosciences (PacBio) and nanopore sequencing from Oxford Nanopore Technologies (ONT) (Booth, 2022). PacBio’s SMRT utilizes so-called zero-mode waveguides (ZMWs), which are essentially wells in a metal film with a diameter of less than 100 nm. Light with much longer wavelengths than 100 nm will decay exponentially in the well, so that fluorophore illumination in only a small part of the well can be detected (Heather & Chain, 2016). This is exploited by attaching a DNA polymerase to the bottom of the well and providing it with a single, long DNA fragment. Then, a mixture of dNTPs marked with different fluorophores is added. As a dNTP is incorporated into the strand, a fluorescence signal can be detected at the bottom of the well, before the fluorescent label is cleaved off and diffuses into the bulk of the solution (Booth, 2022). This not only provides information about the type of nucleotide added, but also gives real time data. The time-course data can contain additional information about modifications of DNA, like epigenetic methylations (Liu et al., 2012). Overall, SMRT sequencing provides ultra-long reads of up to 105 bases, at accuracies of repeated reads similar to Illumina sequencing. However, raw error rates are still high (Mathews, 2024). In ONT’s nanopore sequencing, a motor protein pulls ssDNA through a nanopore in a membrane. Applying a constant voltage drives an ionic current through the pore. Conductivity of the pore changes, depending on the nucleobases that pass through. Thus, the current changes over time, which can be measured. Different currents are characteristic for different DNA sequences (Deamer et al., 2019). Nucleobase sequences as well as epigenetic modifications may be detected. Using ONTs nanopore sequencing, read lengths of up to 2.273⋅106 bp have been reported. ONTs smallest sequencers, the “MinIONs”, are portable, allowing for decentralized real-time sequencing. This has been utilized in real time surveillance of disease outbreaks around the world (Wang et al., 2021). However, output data analysis is complicated, and error rates are still comparatively high (Mathews, 2024). Both PacBio’s SMRT and ONTs nanopore sequencing are still being optimized, and the introduction of new computational workflows and improved chemistry is to be expected, further reducing error rates and complexity. Probably, third-generation sequencing methods will replace second-generation sequencing in most applications (Athanasopoulou et al., 2022).

References Ahmadian, A. et al. (2006). Pyrosequencing: history, biochemistry and future. Clin. Chim. Acta 363(1-2), 83-94. doi.org/10.1016/j.cccn.2005.04.038. Altschul, S. F. et al. (1990). Basic local alignment search tool. J. Mol. Biol. 5;215(3):403-10. doi.org/10.1016/S0022-2836(05)80360-2. Athanasopoulou, K. et al. (2022). Third-Generation Sequencing: The Spearhead towards the Radical Transformation of Modern Genomics. Life 12(1), 30. doi.org/10.3390/life12010030. Benson, D. et al. (1993). GenBank. Nucleic Acids Res. 21(13):2963-2965. doi.org/10.1093/nar/21.13.2963. Booth, M. J. (2022). DNA and RNA sequencing. In: Blackburn, M. G. et al. (eds.) Nucleic Acids in Chemistry and Biology. 4th Edition, The Royal Society of Chemistry, London. Deamer, D. et al. (2019). Three decades of nanopore sequencing. Nat. Biotechnol. 34, 518-524. doi.org/10.1038/nbt.3424. Heather, J. M., Chain, B. (2016). The sequence of sequencers: The history of sequencing DNA. Genomics 107(1), 1-8. doi.org/10.1016/j.ygeno.2015.11.003. International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. doi.org/10.1038/nature03001. Li, W., Freudenberg, J. (2014). Mappability and read length. Front. Genet. 5:381. doi.org/10.3389/fgene.2014.00381. Liu, L. et al. (2012). Comparison of Next-Generation Sequencing Systems. J. Biomed. Biotechnol. 1, 251364. doi.org/ 10.1155/2012/251364. Mathews, A. (2024). DNA Sequencing: A Brief History. In: Abdurakhmonov, I. Y. (eds.). DNA Sequencing – History, Present and Future. Intech Open, Rijeka. doi.org/10.5772/intechopen.1007844. Maxam, A. M., Gilbert, W. (1977). A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74(2):560-4. doi.org/10.1073/pnas.74.2.560. Pinard, R. et al. (2006). Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics 7, 216. doi.org/10.1186/1471-2164-7-216. Rothberg, J. M. et al. (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348-52. doi.org/10.1038/nature10242. Sanger, F. et al. (1977). DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74(12):5463-7. doi.org/10.1073/pnas.74.12.5463. Shendure, J. et al. (2005). Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome. Science 309(5741), 1728-1732. Shendure, J. et al. (2017). DNA sequencing at 40: past, present and future. Nature 550, 345-353. doi.org/10.1038/nature24286. Smith, T. F., Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 25;147(1):195-7. doi.org/10.1016/0022-2836(81)90087-5. Wang, Y. et al. (2022). Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348-1365. doi.org/10.1038/s41587-021-01108-x. Watson, J., Crick, F. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 171, 737–738. doi.org/10.1038/171737a0.