Source: Science. ‘Google for DNA’ indexes 10% of world’s known genetic sequences

‘Google for DNA’ indexes 10% of world’s known genetic sequences

Achievement demonstrates feasibility of making all of life’s code easily searchable, researchers say

illustration of dna strands with a gloved hand plucking one
Ievgen Chepil/Alamy Stock Vector, adapted by N. Jessup/Science

Share:

issue cover image

A version of this story appeared in Science, Vol 384, Issue 6700.Download PDF

A tool that functions like a Google for DNA has demonstrated its promise for making all of the world’s biological sequence data cheaply and easily searchable, according to the Swiss team that developed it. In a proof of principle study, the researchers say they successfully indexed 10% of the world’s known DNA, RNA, and protein sequences—and the same method could be used to do the rest.

The advance, posted last month on bioRxiv, used a computational tool the group recently developed called MetaGraph to organize and compress publicly available sequence data into a searchable format—much as internet search engines do for web pages and their content. The resulting indexes, available for download and via a web portal, allow users to scan sequences comprising trillions of base pairs and billions of amino acids.

The research “represents a massive achievement and a landmark in our ongoing pursuit of the grand challenge of indexing all publicly available sequencing data,” says Rob Patro, a computational biologist at the University of Maryland who wasn’t involved in the pilot effort. Such a resource could aid myriad areas of research, from identifying novel viruses to revealing disease-associated RNA sequences. Although MetaGraph isn’t the only project aiming for this goal, the team has created some of the largest indexes so far and calculates that its tool will be relatively inexpensive to use.

SIGN UP FOR OUR CAREERS NEWSLETTER

Get great career content biweekly!

The need is pressing, Patro and others note. Repositories storing DNA, RNA, and protein sequence data are expanding exponentially. The Sequence Read Archive (SRA), a genetic database run by the National Institutes of Health’s National Center for Biotechnology Information (NCBI) and collaborators, already contains more than 50 thousand trillion base pairs (50 petabases) from organisms including humans and other animals, plants, and bacteria.

Current bioinformatics tools can’t scan this much data all at once, especially for sequences that haven’t yet been assembled into genomes. Researchers have to narrow down the sequence collections before they can search them. Several groups hope to solve this problem by compressing sequences from larger databases into a more organized data structure, or index, designed for easy searching in downloadable files or online portals.

In 2020, bioinformatician André Kahles, computer scientist Gunnar Rätsch, both at ETH Zürich, and their colleagues presented an early version of MetaGraph. The team used its tool, in which mathematical structures known as de Bruijn graphs represent overlaps between sequences, to index more than 1 million records from the SRA, totaling about 3 petabases. They have already employed MetaGraph in projects including identifying the microbial makeup of different cities

https://0474bafb866e839199b2df5d402233ab.safeframe.googlesyndication.com/safeframe/1-0-40/html/container.html

Now, the team has an improved version of MetaGraph, and has harnessed it to index 5 petabases from the SRA and other databases, comprising sequences from microbes, fungi, plants, humans, and the human gut microbiome. Some indexes in the new paper reduce tens of terabases of data into about 10 gigabytes—small enough to work with on a personal computer. Although building the initial indexes is
expensive—hundreds of thousands of dollars for all the SRA, the researchers say—users can query the data sets much more cheaply than with existing techniques.

The work is “hugely exciting,” says Lesley Hoyles, a bioinformatician and microbiologist at Nottingham Trent University. With data repositories ballooning in size, “anything that can reduce the compute storage and energy costs … is a massive plus for researchers worldwide.” Such approaches could lessen barriers to genomic research for scientists in low- and middle-income countries, she adds. “Work could easily be done on cheap laptops.”

Other groups are also making progress. Last year, the Pasteur Institute won €2 million from the European Research Council to launch its IndexThePlanet project to catalog all data in the SRA. And researchers at NCBI are working on their own indexing tool, called Pebblescout. “It’s a very, very active field at the moment,” says Zamin Iqbal, a computational biologist at the University of Bath who worked on AllTheBacteria, a project that assembled bacterial sequence data to make them more easily searchable.

Patro suggests that because of MetaGraph’s index sizes, it could be slower than other tools on some particularly large tasks, such as looking up millions of sequences from a sample simultaneously. It’s also not yet clear how best to update the indexes with new sequence data, he adds. There’s also the challenge of funding the project, as well as all the computational costs that accompany it. Indeed, whether the tool ends up being widely adopted will partly depend on “addressing the social and administrative questions of how such a substantial resource should be hosted, updated, and maintained,” Patro says, adding that it seems “infeasible (and unfair) to expect an individual research group” to take on this enormous task.

Kahles and Rätsch agree, saying they hope the work will inspire other groups, and larger organizations such as NCBI or the SRA, to pick up the project and help index the remaining 90% of sequence data for use by researchers. “We show them here: ‘It’s possible—please do it,’” Rätsch says.


doi: 10.1126/science.zam1hsh

Unknown's avatar

About michelleclarke2015

Life event that changes all: Horse riding accident in Zimbabwe in 1993, a fractured skull et al including bipolar anxiety, chronic fatigue …. co-morbidities (Nietzche 'He who has the reason why can deal with any how' details my health history from 1993 to date). 17th 2017 August operation for breast cancer (no indications just an appointment came from BreastCheck through the Post). Trinity College Dublin Business Economics and Social Studies (but no degree) 1997-2003; UCD 1997/1998 night classes) essays, projects, writings. Trinity Horizon Programme 1997/98 (Centre for Women Studies Trinity College Dublin/St. Patrick's Foundation (Professor McKeon) EU Horizon funded: research study of 15 women (I was one of this group and it became the cornerstone of my journey to now 2017) over 9 mth period diagnosed with depression and their reintegration into society, with special emphasis on work, arts, further education; Notes from time at Trinity Horizon Project 1997/98; Articles written for Irishhealth.com 2003/2004; St Patricks Foundation monthly lecture notes for a specific period in time; Selection of Poetry including poems written by people I know; Quotations 1998-2017; other writings mainly with theme of social justice under the heading Citizen Journalism Ireland. Letters written to friends about life in Zimbabwe; Family history including Michael Comyn KC, my grandfather, my grandmother's family, the O'Donnellan ffrench Blake-Forsters; Moral wrong: An acrimonious divorce but the real injustice was the Catholic Church granting an annulment – you can read it and make your own judgment, I have mine. Topics I have written about include annual Brain Awareness week, Mashonaland Irish Associataion in Zimbabwe, Suicide (a life sentence to those left behind); Nostalgia: Tara Hill, Co. Meath.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a comment