What is long read sequencing?

Download PDF Copy

Long-read sequencing, also called third-generation sequencing, is a DNA sequencing technique currently being researched which can determine the nucleotide sequence of long sequences of DNA between 10,000 and 100,000 base pairs at a time. This removes the need to cut up and then amplify DNA which is normally required in other DNA sequencing techniques.

长读测序又称第三代测序，是目前正在研究的一种DNA测序技术，每次可以确定10,000 - 100,000个碱基对之间的DNA长序列核苷酸序列。
这样就不需要像其他DNA测序技术那样切割并扩增DNA。

What is long read sequencing? Image Credits: Gio.tto / Shutterstock.com

History of DNA sequencing

One of the most basic forms of DNA sequencing is Sanger sequencing. This method can sequence relatively small fragments of DNA of up to about 900 base pairs. Fragments of DNA are replicated many times, all of varying lengths and all with a fluorescent tag on one end. These tagged fragments can be mapped out to determine the exact sequence of the original DNA.

The more modern forms of DNA sequencing are called next-generation sequencing. These techniques are faster, cheaper and can much more efficiently determine long DNA sequences compared to Sanger sequencing. This is achieved through high-throughput analysis of many different DNA fragments at once.

These DNA fragments tend to range from 50-700 base pairs in length, but the techniques used can determine DNA sequences made up of millions of base pairs.

Long-read sequencing, sometimes also called third-generation sequencing, is a very recent DNA sequencing technique that can read the DNA sequence of much longer DNA fragments at a time. These normally range from between 10,000 and 100,000 base pairs but have been shown to be able to read even 1-2 million base pairs at a time.

DNA测序的历史
DNA测序最基本的形式之一是桑格测序。
这种方法可以对多达900个碱基对的相对较小的DNA片段进行测序。
DNA片段被多次复制，长度各不相同，而且在一端都带有荧光标记。
这些带标记的片段可以被绘制出来，以确定原始DNA的确切序列。
更现代的DNA测序被称为下一代测序。
与桑格测序相比，这些技术更快，更便宜，更有效地确定长DNA序列。
这是通过一次性对许多不同的DNA片段进行高通量分析而实现的。
这些DNA片段的长度一般在50-700个碱基对之间，但使用的技术可以确定由数百万个碱基对组成的DNA序列。
长读测序，有时也称为第三代测序，是一种最近出现的DNA测序技术，可以一次读取更长的DNA片段的DNA序列。
这些数据通常在10,000到100,000个碱基对之间，但已经被证明能够一次读取1-2百万个碱基对。

How does long-read sequencing work?

Long-read sequencing has been described as solving a jigsaw puzzle with large pieces. The DNA fragments produced in this technique are easier to assemble into a complete DNA sequence than in other sequencing techniques.

There are two main technologies within scientific research which utilize long-read sequencing: Oxford Nanopore sequencing, and PacBio single-molecule real-time (SMRT) sequencing. These techniques implement different methodologies, but are both capable of sequencing long lengths of DNA.

Nanopore sequencing measures changes in ionic current when single-stranded DNA fragments are moved through a nanopore, which are very small proteins forming pores are embedded within a membrane. Different DNA sequences will produce different levels of resistance when they pass through these pores, so the exact nucleotide sequence can be determined.

SMRT sequencing works by detecting different levels of fluorescence that are generated when a target DNA sequencing is replicated with modified nucleotides. This occurs in a series of wells and is limited by the quality of the DNA polymerase in use.

长期阅读的排序被描述为解决一副拼图游戏的大块。
该技术产生的DNA片段比其他测序技术更容易组装成一个完整的DNA序列。
在科学研究中，利用长读测序有两种主要技术:Oxford Nanopore测序和PacBio单分子实时测序(SMRT)。
这些技术实现不同的方法，但都能够测序长长度的DNA。
纳米孔测序测量的是当单链DNA片段通过纳米孔时离子电流的变化，纳米孔是嵌入在膜内形成孔的非常小的蛋白质。
不同的DNA序列在通过这些孔时会产生不同程度的抗性，因此可以确定确切的核苷酸序列。
SMRT测序工作通过检测不同水平的荧光，当目标DNA测序被修饰的核苷酸复制时产生。
这一过程发生在一系列井中，并且受到所用DNA聚合酶质量的限制。

Advantages of long-read sequencing

Long-read sequencing has several distinct advantages compared to next-generation sequencing technologies.

One of the major advantages is that long-read sequencing can much more accurately sequence DNA containing repeats, which is where the same sections of DNA repeated within the genome. Sanger sequencing and next-generation sequencing often struggle with these repeats when assembling their DNA fragments.

These repeats, or copy number variations, are much easier to detect in long-read sequencing which is very important. For example in Huntingdon’s disease, the copy number of the DNA sequence ‘CAG’ dictates if a person is likely to develop the disease. Determining this copy number can have large implications in the diagnosis or prediction of genetic disease.

This sequencing technology can also more accurately detect larger-scale mutations, where long sections of DNA are deleted or moved. These structural variants often have roles in genetic disorders but have not been extensively studied in the past due to the lack of technology available.

长读测序的优势
与新一代测序技术相比，长读测序有几个明显的优势。
其中一个主要优点是，长读测序可以更精确地对含有重复序列的DNA进行排序，即基因组中相同片段的DNA在基因组中重复。
桑格测序和新一代测序在组装DNA片段时往往会遇到这些重复。
这些重复，或拷贝数变化，在长读测序中更容易检测，这是非常重要的。
例如，在亨廷顿氏病中，DNA序列CAG的拷贝数决定了一个人是否可能发展成疾病。
确定这个拷贝数对遗传疾病的诊断或预测有很大的意义。
这种测序技术还可以更准确地检测大规模突变，即长段DNA被删除或移动。
这些结构变异通常在遗传疾病中发挥作用，但由于缺乏可用的技术，在过去没有得到广泛的研究。

What has been achieved with long-read sequencing?

In 2018, Jain et al. and other researchers from the University of California used long-read sequencing to accurately map the human Y chromosome centromere. The centromere is a very important section of all chromosomes which has a vital role within division, and its dysregulation has been linked to cancer formation and several different genetic syndromes like Down’s Syndrome and Turner Syndrome.

Nanopore sequencing has been used to detect and identify pathogens within clinical environments in as short as 6 hours from when the samples were taken.

Nanopore sequencing was also used during the ebola outbreak to rapidly and efficiently test blood samples for presence of the virus. The equipment was flown into West Africa and used directly on-site to monitor the epidemic.

长读测序取得了什么成果?
2018年，Jain等人和来自加州大学的其他研究人员使用长读测序技术准确绘制了人类Y染色体着丝粒。
着丝粒是所有染色体中一个非常重要的部分，在分裂中起着至关重要的作用，它的失调与癌症的形成和几种不同的遗传综合征，如唐氏综合征和特纳综合征有关。
纳米孔测序已被用于在临床环境中检测和识别病原体，从样本采集到现在仅6小时。
在埃博拉爆发期间，纳米孔测序也被用于快速、有效地检测血液样本是否存在病毒。
这些设备被空运到西非，直接用于现场监测疫情。

Sources

Heather, J. M., & Chain, B. (2016). The sequence of sequencers: The history of sequencing DNA. Genomics. https://doi.org/10.1016/j.ygeno.2015.11.003

PHG Foundation. Long read sequencing technologies. (2018). www.phgfoundation.org/.../long-read-sequencing-ready-for-implementation

Koren, S., & Phillippy, A. M. (2015). One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current opinion in microbiology. https://doi.org/10.1016/j.mib.2014.11.014

Amarasinghe, S. L., et al., (2020). Opportunities and challenges in long-read sequencing data analysis. Genome biology. https://doi.org/10.1186/s13059-020-1935-5

Eid, J., et al., (2009). Real-time DNA sequencing from single polymerase molecules. Science. https://doi.org/10.1126/science.1162986

Jain, M., et al., (2016). The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome biology. https://doi.org/10.1186/s13059-016-1103-0

What is long read sequencing?
October 2018
Emma Johnson
emma.johnson@phgfoundation.org
Sobia Raza
sobia.raza@phgfoundation.org

DNA sequencing the process of reading part or all of the DNA of an organism is helping to improve clinical care across different areas of medicine, from rare diseases and cancers, to the management of infectious diseases.

Progress has been accelerated by the advancement of high-throughput nextgeneration sequencing (NGS) technologies, which are capable of reading the code of millions of small fragments of DNA in parallel. These have enabled faster sequencing with increased throughput, at falling costs. In recent years, new technologies that are capable of sequencing longer strands of DNA by reading single DNA molecules, have advanced and become more prominent. This briefing explains what long-read sequencing (LRS) is, and how it differs from established short-read sequencing (SRS). The second, accompanying briefing, Long-Read Sequencing: Ready for the Clinic? describes the potential of these technologies for diagnostic sequencing in a clinical setting, and in this context the challenges with implementing the technology.

The essentials
Single molecule, true long-read sequencers enables the production of reads that are considerably longer than those resulting from SRS. This has several inherent advantages
LRS can sequence parts of the genome that cannot easily be sequenced by short-read sequencing. Longer reads are more likely to look distinct compared to shorter reads, allowing them to be assembled together with less ambiguity

The two dominant producers of true long-read sequencing technologies are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (Nanopore)

什么是长读序列?
2018年10月
艾玛·约翰逊
emma.johnson@phgfoundation.org
索比亚Raza
sobia.raza@phgfoundation.org

DNA测序读取有机体的部分或全部DNA的过程，有助于改善不同医学领域的临床护理，从罕见疾病和癌症，到传染病的管理。

下一代高通量测序(NGS)技术的发展加速了这一进程，这种技术能够同时读取数百万个DNA小片段的编码。
这使得测序速度更快，通量增加，成本降低。
近年来，能够通过读取单个DNA分子来对较长DNA链进行测序的新技术不断发展，并变得更加突出。
本简要介绍了什么是长读测序(LRS)，以及它与已建立的短读测序(SRS)有何不同。
第二篇，伴随简报，长期阅读的顺序:准备好去诊所了吗?
描述这些技术在临床诊断测序中的潜力，以及在此背景下实施该技术的挑战。

要点
单分子，真正的长读测序器能够产生比SRS长得多的测序结果。
这有几个固有的优点

LRS可以对基因组中无法通过短读测序进行测序的部分进行测序。
与较短的阅读相比，较长的阅读看起来更清晰，这使它们能够以更少的歧义组合在一起

真正的长时间测序技术的两大主要生产商是太平洋生物科学公司(PacBio)和牛津纳米孔技术公司(Nanopore)

What is long-read sequencing?
The genome of most organisms (including humans) is too long to be sequenced as one continuous string.
Using next-generation ‘short-read’ sequencing, DNA is broken into short fragments that are amplified(copied) and then sequenced to produce ‘reads’.
Bioinformatic techniques are then used to piece together the reads like a jigsaw, into a continuous genomic sequence.
True LRS technologies – sometimes referred to as third generation sequencers – directly sequence single molecules of DNA in real time, often without the need for amplification.
This direct sequencing approach enables the production of reads that are considerably longer than those resulting from SRS.
Other,’synthetic’ long-read sequencing approaches utilise modified sample processing and conventional SRS to computationally reconstruct long reads from shorter sequencing reads.
True LRS represents the greatest departure from widely used short-read systems.
Currently, the two dominant producers of ‘true’ long-read sequencing technologies are Pacific Biosciences(PacBio) and Oxford Nanopore Technologies (Nanopore).
Both have developed platforms for ‘real-time’ sequencing of nucleic acids (DNA and RNA) that is faster than current short-read technologies.

什么是长读序列?
大多数生物(包括人类)的基因组太长，不能作为一个连续的序列进行测序。
利用下一代“短读”测序技术，DNA被分解成短片段，经过扩增(复制)，然后测序产生“短读”。

生物信息技术随后被用来将读到的信息像拼图一样拼凑成一个连续的基因组序列。
真正的LRS技术——有时也被称为第三代测序器——直接实时测序DNA的单个分子，通常不需要扩增。

这种直接测序方法能够产生比SRS长得多的reads。

另外，“合成”长读测序方法利用改良的样本处理和传统的SRS从较短的测序数据中计算重建长读数据。

真正的LRS与广泛使用的短读系统有很大的不同。
目前，“真正的”测序技术的两大主要生产商是太平洋生物科学公司(PacBio)和牛津纳米孔技术公司(Nanopore technologies)。

这两家公司都开发了核酸(DNA和RNA)“实时”测序平台，比目前的短读技术更快。

What is long read sequencing?

The benefits
There are several inherent benefits in using longer reads for the examination of genomic data; these can have advantages for clinical genome analysis.
• Genome assembly:
The human genome is over 3 billion DNA base pairs in length and contains many repetitive stretches of genetic code.
Like a complex jigsaw, reassembling the genome from short reads can be challenging, as many fragments look highly similar without additional context.
Long-read data can make this task simpler as the reads are more likely to look distinct, allowing them to be assembled together with less ambiguity and error.
Improvements in genome assembly are helping to close gaps in our knowledge of the genome and allow for a better understanding of the genetic causes of disease.
• Variant detection:
Some features of individual genomes are particularly difficult to detect and quantify with SRS technologies, for example:
large and complex rearrangements, large insertions or deletions of DNA, repetitive regions, highly polymorphic regions, or regions with low DNA nucleotide diversity.
Long reads can span across larger parts of these regions, so are able to detect more of these variants, which may be clinically relevant.
LRS may also enhance the ‘genome-wide’ detection of certain variants .
• Haplotype phasing:

In areas such as reproductive medicine it can be useful to know whether genetic variants exist on the same copy of the chromosome.
This can be determined using a process known as haplotype phasing.
Long reads are able to provide the long-range information for resolving haplotypes without additional statistical inference, maternal/paternal sequencing,
or sample preparation, as is required for an approximation of phasing using SRS.
Beyond producing long reads, true LRS technologies have other features that present new opportunities.
Amongst these are:
• Portability:
In contrast to other sequencing platforms, Nanopore’s devices rely on detecting electronic rather than optical signals.
This allows them to design devices as small as a memory (USB) stick, making them highly portable.
Many other sequencers, including the vast majority of SRS systems, are large desktop or free-standing machines.
Nanopore’s MinION device has been used to sequence samples in the field during the Ebola and Zika virus outbreaks and has even been used in space.
• Real-time sequencing and speed:
Compared to the fixed run times of SRS systems, both PacBio and Oxford Nanopore offer faster sequencing runs.
PacBio provides options for rapid sequencing that can be completed in <24hours, from sample preparation to analysis.
Nanopore technologies permit real-time analyses and allow experimental run time to be determined by the user, giving the user the ability to track data collection and begin analyses as desired.
This provides additional flexibility and speed, and removes the need for batch sequencing of multiple samples which is currently required for cost-effective SRS.
It is particularly useful when examining small genomes (such as those of many pathogens) or specific genomic regions.
• Other ‘omics:
Long-read technologies have been used to directly sequence RNA.
They may also allow simultaneous detection of epigenetic modifications (chemical modifications to DNA/RNA that affect how genes are expressed), although additional bioinformatic interpretation is required.
Separate sequencing runs need to be performed to retrieve this information using current SRS systems.

好处
使用更长的阅读来检查基因组数据有几个内在的好处;
这些在临床基因组分析中具有优势。
•基因组组装:
人类基因组的长度超过30亿个DNA碱基对，包含许多重复的遗传密码片段。
就像复杂的拼图一样，从短片段中重组基因组可能是一项挑战，因为许多片段在没有附加上下文的情况下看起来高度相似。
长时间读取的数据可以使这项任务变得更简单，因为读取的数据看起来更清晰，可以将它们组合在一起，减少歧义和错误。
基因组组装的改进有助于缩小我们对基因组知识的差距，并使我们能够更好地理解疾病的遗传原因。
•变异检测:
个别基因组的某些特征特别难以用SRS技术检测和量化，例如:
大规模和复杂的重排、大量的DNA插入或缺失、重复区域、高度多态性区域或DNA核苷酸多样性低的区域。
长阅读可以跨越这些区域的较大部分，因此能够检测更多的这些变异，这可能是临床相关的。
LRS还可以增强对某些变异的“全基因组”检测。
•单倍型分期:

在生殖医学等领域，了解基因变异是否存在于同一条染色体上是很有用的。

这可以通过一个称为单倍型分阶段的过程来确定。
Long reads能够提供解决单倍型的长程信息，而无需额外的统计推断、母系/父系测序、
或样品制备，这是使用SRS近似分相所需要的。
除了产生长读之外，真正的LRS技术还有其他特点，这些特点带来了新的机遇。

其中有:
•可移植性:
与其他测序平台相比，Nanopore的设备依靠检测电子信号而非光学信号。
这使得他们能够设计出像记忆棒(USB)一样小的设备，使它们具有高度的便携性。
许多其他序列器，包括绝大多数SRS系统，都是大型台式机或独立机器。
Nanopore公司的MinION设备已被用于埃博拉和寨卡病毒爆发期间的现场样本测序，甚至还被用于太空。
•实时排序和速度:
与SRS系统的固定运行时间相比，PacBio和Oxford Nanopore提供了更快的测序运行时间。
PacBio提供了快速测序选项，从样品制备到分析，可以在24小时内完成。
纳米孔技术允许实时分析，并允许用户决定实验运行时间，使用户能够跟踪数据收集和开始分析所需的能力。

这提供了额外的灵活性和速度，并消除了目前成本效益高的SRS需要对多个样品进行批量测序的必要性。
它在检查小基因组(如许多病原体)或特定基因组区域时特别有用。
•其他“组学:

长读技术已经被用来直接测序RNA。

它们也可以同时检测表观遗传修饰(影响基因表达的DNA/RNA的化学修饰)，尽管需要额外的生物信息学解释。

需要使用当前的SRS系统执行单独的测序来检索这些信息。

Conclusion
The inherent benefits of utilising longer reads for genome reconstruction and analysis, alongside the additional potential advantages true LRS systems present for genome analysis, could be beneficial for the diagnosis of several diseases and disorders.
However, LRS systems also present their own challenges, and come with some limitations;
this and their potential for use in clinical sequencing is discussed in the accompanying briefing.

结论

利用更长的读取时间进行基因组重建和分析的内在好处，以及true LRS系统为基因组分析提供的额外潜在优势，可能有利于多种疾病和紊乱的诊断。

利用下一代“短读”测序技术，DNA被分解成短片段，然后进行扩增(复制)然后测序产生“读”。

本文将对其在临床测序中的应用前景进行讨论。

History of DNA sequencing

How does long-read sequencing work?

Related Stories

Advantages of long-read sequencing

What has been achieved with long-read sequencing?

Sources