The genetic repertoire of the deep sea: from sequence to structure and function

The deep sea as the largest and maybe most hostile environment on Earth is still underexplored especially regarding its genetic repertoire. Yet, previous work has revealed significant habitat-specific deep-sea biodiversity. Here, we present an integrated deep-sea genetic dataset comprising of 502 million nonredundant genes from 2,138 samples and 2.4 million predicted structures, revealing unprecedented microbial genetic diversity. Global sequence analysis combined with biophysical and biochemical measurements allowed us to link specific protein structures with genetic variants required for life in the deep sea and to advance biotechnology. Furthermore, estimating the rate of substitutions revealed that genes involved in replication, recombination and repair appear to be critical for microbial life in the deep sea. Among them was a structurally unique helicase which enabled ultra-rapid nanopore sequencing (390±11 bp/s). Thus, our work not only deciphers ecological drivers and evolutionary forces underpinning the deep-sea genetic diversity, but it also bridges genetic knowledge with biotechnology.

Readme

SeqFinder

In this module, users can analyze gene or protein sequences using the Diamond software. After uploading query sequences, the "deepsea" database should be selected to perform the search. Advanced query options are available, allowing users to filter results by sequence identity, E-value, bit-score, and sensitivity.Once a job is submitted, users can monitor its status (e.g., Waiting, Running, Success, Error) and apply filters to manage previous tasks. Please note that all results will be automatically deleted after one week.

To download sequence information, click the ID in the target column on the results page. Alternatively, users can select multiple entries and click "Export table" to export an alignment file. The exported file contains download links, which can be used with the wget command for batch retrieval of sequence data. e.g. wget -O xxx.fa "https://xxxx" (ps: Links must be enclosed in double quotes.)

SeqFinder

StructFinder

In this module, protein structure analysis is powered by Foldseek. The workflow involves: 1) uploading a query structure, 2) selecting from the "deepsea" database tiers (High, Good, or Low) for the search, and 3) choosing a search mode. The available modes are 3Di+AA Gotoh-Smith-Waterman for fast local alignment and TMalign for more comprehensive global alignment, which requires more time. The rules for task monitoring and result saving time are the same as above.

On the results page, click "view result" to explore structural details. To download structure information, click the ID in the target column on the results page. Alternatively, users can select multiple entries and click "Export table" to export an alignment file. The exported file contains download links, which can be used with the wget command for batch retrieval of structure data. e.g. wget -O xxx.pdb "https://xxxx" (ps: Links must be enclosed in double quotes.)

StructFinder

Tips for Searching Remotely Related Sequences

To find remotely related sequences, users can simply convert the protein structures identified from the StructFinder to FASTA format with the pdb2fasta tool, and then perform your search with the SeqFinder.

How to cite

Guo, Yang, et al. The genetic repertoire of deep-sea microbiome: From sequence to structure and function. Cell Host & Microbe (2026) DOI: 10.1016/j.chom.2026.05.009