Computer Scientists Develop Program to Find ‘Low-Frequency’ Variants in Sequence Data

Computer Scientists Develop Program to Find 'Low-Frequency' Variants in Sequence Data

An illustration defines what differentiates single nucleotide variants (iSNVs) within a single host from single nucleotide polymorphisms that spread from host to host. Computer scientists at Rice University introduced Variabel, which uses sequencing data to identify low-frequency within-host variants of SARS-CoV-19 from public datasets. Credit: Treangen Lab

Details about the variants hidden in the deluge of SARS-CoV-2 genetic sequences would be good to know, if only researchers could access them.

A new program developed at Rice University’s George R. Brown School of Engineering will make possible, at least for “within-host variants,” those that appear in genomic data from the same COVID-19-positive person.

A Rice team led by computer scientist Todd Treangen and graduate student Yunxi Li developed Variablewhich accurately identifies “low-frequency variants” of the virus that causes COVID-19.

Finding these clues could be key to identifying potentially devastating variants before they have a chance to spread, Treangen said.

Data is available for free, but there is a lot of it. Research makes available low-frequency variant extraction for approximately half a million SARS-CoV-2 genomes collected by Oxford Nanopore Technologies (ONT), which offers an affordable platform for rapid sequencing of long single molecules of DNA or RNA.

“Variabel directly enables the use of affordable nanopore sequencing technology for the identification of within-host variation after viral infection,” said Treangen, whose work focused on infectious disease surveillance long before the COVID-19 pandemic.

The lab had similar success testing Variabel on sequence data from patients infected with Ebola and norovirus.

The open-source program, detailed in Nature Communicationcan be downloaded from

The researchers say the key to Variabel is its ability to distinguish true variants from sequencing errors in the ONT process.

To validate Variabel, they compared data collected over time from single positive patients as well as sequences from between-patient datasets, produced by ONT and another sequencing technique, Illumina. Over time, a single patient can harbor up to a billion copies of a virus.

By comparing the results before and after applying Variabel to the data, they found that the program was able to correct the vast majority of sequencing errors.

“Variabel opens the door to portable, affordable, and rapid characterization of within-host variation, which could ultimately aid in the discovery of future mutations specific to variants of concern,” said Treangen, whose lab, with the Ken Kennedy Institute of Rice, hosted a March 11 Symposium to discuss scientific advances spurred by the pandemic.

The paper’s co-authors are Rice’s undergraduate Joshua Kearney and software engineer Bryce Kille, along with Baylor College of Medicine postdoctoral fellow Medhat Mahmoud and Fritz Sedlazeck, associate professor at the Human Genome Sequencing Center. . Treangen is an assistant professor of computer science.

Scientists show benefits of bioinformatics with PlasmidHawk tool

More information:
Yunxi Liu et al, Rescuing low-frequency variants within intra-host viral populations directly from Oxford Nanopore sequencing data, Nature Communication (2022). DOI: 10.1038/s41467-022-28852-1

Provided by Rice University

Quote: Computer Scientists Develop Program to Find “Low Frequency” Variants in Sequence Data (March 14, 2022) Retrieved March 14, 2022 from variants-sequence.html

This document is subject to copyright. Except for fair use for purposes of private study or research, no part may be reproduced without written permission. The content is provided for information only.