Algorithms for structural variation discovery using hybrid sequencing technologies and library preparation protocols.

Hibrit dizileme teknolojileri ve kütüphane hazırlama protokolleri kullanarak yapısal varyasyonların bulunması için algoritmalar.

Scientific and Technical Research Council of Turkey (TÜBİTAK-1001-215E172), 2016-2018 Abstract

Genomic structural variation (SV) is defined by the 1000 Genomes Project as variation that affects more than 50 basepairs. These variations can be in different forms such as deletion, insertion, inversion, translocation, retrotransposition, or interspersed or tandem duplications. Although there are much less SVs than single nucleotide polymorphisms (SNPs) (3.5 million SNPs vs. 10-15 thousand SVs), the total number of basepairs affected by SVs are substantially higher (3.5 Mbp SNP, 15-20 Mbp SV).

Widespread occurrence of SVs in non-cancer genomes were first shown in 2004 bi Iafrate et al. It was later understood that SVs also cause several complex diseases such as Crohn’s, schizophrenia, and autism. Array comparative genomic hybridization (array CGH) was the dominant technology for specifically copy number variation (CNV) discovery, however, high throughput sequencing (HTS) became more popular for such studies after their intruduction in 2007. Still, as demonstrated in the 1000 Genomes Project, since HTS platforms either produce short reads (Illumina, Complete Genomics, Ion Torrent, SOLiD), or with high error rates (Pacific Biosciences, Oxford Nanopore), although there is relatively high success in CNV discovery, reliable algorithms for characterizing complex SVs such as inversions, translocations, and novel sequence insertions are still lacking. The fact that such complex variation usually occur in highly repetitive regions of the genome makes it harder to align HTS reads. This negatively affects our ability to understand the genetic causes of several complex diseases, therefore limits solving the missing heritability problem.

Although all sequencing technologies have problems in either read length, base pair calling accuracy, or error profiles, bias in one technology may present itself as a strength in another. For example, Illumina reads are short, but Pacific Biosciences produce long reads, while Pacific Biosciences error rate is high (>15%), Illumina has high accuracy (>99.9%). In addition, independent from the sequencing technology, recently new library preparation techniques were developed, such as Illumina TSLR, 10X Genomics, Dovetail Genomics, and pooled clone sequencing. It is possible to obtain long range contiguity information using these methods, without changing the sequencing technology itself.

In this project, we propose to use different sequencing techniques and library preparation protocols in an integrated fashion to reliably characterize structural variation. Therefore we will be able to complement the strengths of different technologies with each other, and correct for the biases. These algorithms will enable better characterization of complex structural variation such as inversions and translocations, and help solve the missing heritability problem.