We are currently witnessing a deep knowledge revolution due to the availability of exponentially expanding sequence databases made possible by the continuously accelerating throughput of sequencing techniques. This trend is highlighted, for instance, in the Earth Bio-Genome Project which was presented during the World Economic Forum Davos 2018- this project aims to “use genomics to help discover the remaining 80 to 90 percent of species that are currently hidden from science”.
Sequencing data is accumulating faster than Moore’s Law, bringing fundamental new biological insights, conjecture, and understanding, with impacts on medicine, agronomy and ecology. The main objectives have been to assemble new genomes in order to compare specific organisms to representative reference species, highlighting genomic variations that reveal genetic properties correlated to ecological, agronomical or clinical markers. Today, the International Nucleotide Sequence Database Collaboration (INSDC) Sequence Read Archive (SRA) stores over 10,000 Pb nucleotides in the form of short sequences (<1000 bp), which represent fragments from generally unknown genomic locations (randomly sampled “reads” from shotgun sequencing projects). However, the overwhelming majority of those sequences have only been analysed within the context of single project, each addressing only a small fraction of the total resource.
It is therefore of primary importance to maintain a pattern of diversity for meta-analyses in the future and to develop technologies to interrogate data across project boundaries. Access to entire data sets as opposed to single or limited number of read sets would provide researchers unparalleled opportunities to make novel discoveries. Unfortunately, raw sequences stored in genomic data banks such as the SRA are not indexed and therefore cannot be queried efficiently, apart from direct accession lookups. Oftentimes, these data sets are never revisited because of the huge overhead involved in manipulating such voluminous data. Today, it would be unthinkable to access the Internet without powerful search engines. However, this is precisely the current situation for raw read archives, where precious data sleep undisturbed in rarely-opened drawers.
The central objective of the SeqDigger project is to provide an ultra fast and user-friendly search engine that compares a query sequence, typically a read or a gene (or a small set of such sequences), against the exhaustive set of all available data corresponding to one or several large-scale metagenomic sequencing project(s), such as New York City metagenome, Human Microbiome Projects (HMP or MetaHIT), Tara Oceans project, Airborne Environment, etc. This would be the first ever occurrence of such a comprehensive tool, and would strongly benefit the scientific community, from environmental genomics to biomedicine.