0% found this document useful (0 votes)
15 views17 pages

bp2 5

Uploaded by

aparnasajeev2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

bp2 5

Uploaded by

aparnasajeev2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

nature biotechnology

Brief Communication https://doi.org/10.1038/s41587-023-01773-0

Fast and accurate protein structure search


with Foldseek

Received: 17 February 2022 Michel van Kempen 1,6, Stephanie S. Kim2,6, Charlotte Tumescheit2,
Milot Mirdita 1,2, Jeongjae Lee 2, Cameron L. M. Gilchrist2,
Accepted: 30 March 2023
Johannes Söding 1,3 & Martin Steinegger 2,4,5
Published online: 8 May 2023

Check for updates As structure prediction methods are generating millions of publicly
available protein structures, searching these databases is becoming a
bottleneck. Foldseek aligns the structure of a query protein against a
database by describing tertiary amino acid interactions within proteins
as sequences over a structural alphabet. Foldseek decreases computation
times by four to five orders of magnitude with 86%, 88% and 133% of the
sensitivities of Dali, TM-align and CE, respectively.

The recent developments in in silico protein structure prediction at to five orders of magnitude faster: an all-versus-all comparison of
near-experimental quality1,2 are advancing structural biology and bio- 100 million sequences would take MMseqs2 (ref. 6) only around a week
informatics. The European Bioinformatics Institute already holds over on the same cluster.
214 million structures predicted by AlphaFold2 (ref. 3), and the ESM Structural alignment tools (reviewed in ref. 12) are slower for two
Atlas contains over 617 million metagenomic structures predicted reasons. First, whereas sequence search tools employ fast and sensitive
by ESMFold4. The scale of these databases poses challenges to prefilter algorithms to gain orders of magnitude in speed, no similar
state-of-the-art analysis methods. prefilters exist for structure alignment. Second, structural similarity
The most widely used approach to protein annotation and scores are non-local: changing the alignment in one part affects the
analysis is based on sequence similarity search5–8. The goal is to similarity in all other parts. Most structural aligners, such as the popular
find homologous sequences from which properties of the query TM-align, Dali and CE11,13,14, solve the alignment optimization problem
sequence can be inferred, such as molecular and cellular functions by iterative or stochastic optimization.
and structure. Despite the success of sequence-based homology To increase speed, a crucial idea is to describe the amino acid back-
inference, many proteins cannot be annotated because detecting bone of proteins as sequences over a structural alphabet and compare
distant evolutionary relationships from sequences alone remains structures using sequence alignments15. Structural alphabets thus
challenging9. reduce structure comparisons to much faster sequence alignments.
Detecting similarity between protein structures by three- Many ways to discretize the local amino acid backbone have been pro-
dimensional (3D) superposition offers higher sensitivity for identify- posed16. Most, such as CLE, 3D-BLAST and Protein Blocks, discretize the
ing homologous proteins10. The availability of high-quality structures conformations of short stretches of usually 3–5 Cα atoms17–19.
for any protein of interest allows us to use structure comparison to For Foldseek, we developed a type of structural alphabet that does
improve homology inference and structural, functional and evolution- not describe the backbone but, rather, tertiary interactions. The 20
ary analyses. However, despite decades of effort to improve speed and states of the 3D interaction (3Di) alphabet describe for each residue i
sensitivity of structural aligners, current tools are much too slow to the geometric conformation with its spatially closest residue j. 3Di has
cope with today’s scale of structure databases. three key advantages over traditional backbone structural alphabets.
Searching with a single query structure through a database with (1) Weaker dependency between consecutive letters and (2) more
100 million protein structures would take the popular TM-align11 tool evenly distributed state frequencies, both enhancing information
a month on one CPU core, and an all-versus-all comparison would density and reducing false positives (FPs) (Supplementary Table 1).
take 10 millennia on a 1,000-core cluster. Sequence searching is four (3) The highest information density is encoded in conserved protein

Quantitative and Computational Biology Group, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany. 2School of Biological Sciences,
1

Seoul National University, Seoul, South Korea. 3Campus Institute Data Science (CIDAS), Göttingen, Germany. 4Artificial Intelligence Institute, Seoul
National University, Seoul, South Korea. 5Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea. 6These authors
contributed equally: Michel van Kempen, Stephanie S. Kim. e-mail: soeding@mpinat.mpg.de; martin.steinegger@snu.ac.kr

Nature Biotechnology | Volume 42 | February 2024 | 243–246 243


Brief Communication https://doi.org/10.1038/s41587-023-01773-0

a Query Target b (4) (Discretization) conversion to 3Di sequence


Cαj+1
… Cαi+1
Virtual center Amino acid …Val…
Cαj−1 Cαi
d
3Di sequence … A …
(1) Discretize structure to Cαj
sequence b and prefilter
Cαi−1 cos φ1,2 cos φ1,2
k-mer Double Ungapped cos φ1,3 cos φ1,3
match on alignment
A State1
diagonal cos φ1,4 cos φ1,4
Query

Z0A Z1A
cos φ1,5 cos φ1,5
C State2
cos φ2,3 Z0 Z0C ZC
1 Z0A cos φ2,3
… Val … Encoder Decoder
cos φ3,4 Z1 D State3 ZA1 cos φ3,4
Z0D Z1D
Targets cos φ3,5 cos φ3,5

(2) Structural alignment d d


Gapped Y State21
f1(i−j) Z0Y Z1Y f1(i−j)
alignment
f2(i−j) f2(i−j)

(1) Find neighboring residues (2) Extract features (3) Search 3Di state library (4) (Training) predict
using virtual center features

Fig. 1 | Foldseek workflow. a, Foldseek searches a set of query structures through center distance (yellow). Virtual center positions (Supplementary Fig. 1) were
a set of target structures. (1) Query and target structures are discretized into 3Di optimized for maximum search sensitivity. (2) To describe the interaction
sequences (see b). To detect candidate structures, we apply the fast and sensitive geometry of residues i and j, we extract seven angles, the Euclidean Cα distance
k-mer and ungapped alignment prefilter of MMseqs2 to the 3Di sequences, (2) and two sequence distance features from the six Cα coordinates of the two
followed by vectorized Smith–Waterman local alignment combining 3Di and backbone fragments (blue and red). (3) These 10 features are used to define
amino acid substitution scores. Alternatively, a global alignment is computed 20 3Di states by training a VQ-VAE28 modified to learn states that are maximally
with a 1.7-times accelerated TM-align version (Supplementary Fig. 12). evolutionary conserved. For structure searches, the encoder predicts the best-
b, Learning the 3Di alphabet. (1) 3Di states describe tertiary interaction between matching 3Di state for each residue.
a residue i and its nearest neighbor j. Nearest neighbors have the closest virtual

cores and the lowest in non-conserved coil/loop regions, whereas the and TM-align, higher than the structural aligner CE and much above
opposite is true for backbone structural alphabets. the structural alphabet-based search tools 3D-BLAST and CLE-SW
Foldseek (https://foldseek.com/) (Fig. 1a) (1) discretizes the (Fig. 2a). In a precision-recall analysis, Foldseek-TM and Foldseek have
query structures into sequences over the 3Di alphabet and then uses the highest and third-highest area under the precision-recall curve on
a pre-trained 3Di substitution matrix (Supplementary Table 2) to each of the three levels (Fig. 2b and Supplementary Fig. 4). Notably,
search through the 3Di sequences of the target structures using the Foldseek-TM improves over TM-align because its prefilter suppresses
double-diagonal k-mer-based prefilter and gapless alignment prefilter high-scoring FPs. Both sort hits by the average query and target length
modules from MMseqs2, our open-source sequence search software6. normalized TM-scores for best performance in the SCOPe benchmark.
(2) High-scoring hits are aligned locally using 3Di (default) or globally Foldseek’s performance is similar across all six secondary struc-
with TM-align (Foldseek-TM). The local alignment stage combines ture classes in SCOPe (Supplementary Fig. 5). On this small SCOPe40
3Di and amino acid substitution scores. The construction of the 3Di benchmark set, Foldseek is more than 4,000 times faster than TM-align
alphabet is summarized in Fig. 1b and Supplementary Figs. 1–3. and Dali and over 21,000 times faster than CE (Fig. 2c). On the much
To reduce high-scoring FPs and provide reliable E values, we sub- larger AlphaFoldDB (version 1), where Foldseek approaches its full
tracted the reversed query alignment score from the original score speed, it is around 184,600 and 23,000 times faster than Dali and
and applied a compositional bias correction within a local 40-residue TM-align, respectively (see below).
sequence window (see the ‘Pairwise local structural alignments’ subsec- We devised a reference-free benchmark to assess search sensitivity
tion). E values are calculated using an extreme-value score distribution, and alignment quality of structural aligners (Fig. 2d) on a realistic set
with parameters predicted by a neural network based on 3Di sequence of full-length, multi-domain proteins. We clustered the AlphaFoldDB
composition and query length (see the ‘E values’ subsection). Ranking (version 1) to 34,270 structures using BLAST and SPICi22. We randomly
of hits is determined by alignment bit score multiplied by the geometric selected 100 query structures from this set and aligned them against
mean of alignment TM-score and local distance difference test (LDDT). the remaining structures. TP matches are those with an LDDT score23
Foldseek also reports the probability for each match to be homologous, of at least 0.6 and FPs below 0.25, ignoring matches in between. We set
based on a fit of true and false matches on SCOPe. the LDDT thresholds according to the median inter-fold and intra-fold
We measured the sensitivity and speed of Foldseek, six pro- superfamily and family LDDT scores of SCOPe40 alignments (Sup-
tein structure alignment tools, an alignment-free structure search plementary Fig. 6). For other thresholds, see Supplementary Fig. 7. A
tool (Geometricus20) and a sequence search tool (MMseqs2 (ref. 6)) domain-based sensitivity assessment would require a reference-based
on the SCOPe dataset of manually classified single-domain struc- prediction of domains. To avoid it, we evaluated the sensitivity per resi-
tures21. Clustering SCOPe 2.01 at 40% sequence identity yielded 11,211 due. Figure 2d shows the distribution of the fraction of query residues
non-redundant protein sequences (SCOPe40). We performed an that were part of alignments with at least x TP targets with better scores
all-versus-all search and compared the tools’ performance for finding than the first FP match. Again, Foldseek has similar sensitivity as Dali, CE
members of the same SCOPe family, superfamily and fold (true-positive and TM-align and much higher sensitivity than CLE-SW and MMseqs2.
(TP) matches) by measuring for each query the fraction of TPs out of all We analyzed the quality of alignments produced by the top five
possible correct matches until the first FP, a match to a different fold matches per query. We computed the alignment sensitivity as the num-
(see the ‘SCOPe benchmark’ subsection). ber of TP residues divided by the query length and the precision as the
We first measured the sensitivity to detect relationships at family number of TP residues divided by the alignment length. TP residues are
and superfamily level by the area under the curve (AUC) of the cumu- those with residue-specific LDDT score above 0.6; FP residues are below
lative receiver operating characteristic (ROC) curve up to the first FP 0.25; and residues with other scores are ignored. Figure 2e shows the
(Fig. 2a and Supplementary Fig. 4). Foldseek’s sensitivity is below Dali average sensitivity versus precision of the 100 × 5 structure alignments.

Nature Biotechnology | Volume 42 | February 2024 | 243–246 244


Brief Communication https://doi.org/10.1038/s41587-023-01773-0

a b c
1.00 Superfamily (AUROC1) 1.00 Superfamily (weighted ROC)
Sensitivity up to the 1st FP

Fold Superfamily Family


Foldseek-TM 106
0.75 0.75 TM-align
TM-align−fast 105

Precision

Time (s)
Foldseek
Dali 104
0.50 0.50
CE
CLE−SW 103
0.25 0.25 3D-BLAST
MMseqs2 102
Geometricus
101
0 0

0 0.25 0.50 0.75 1.00 0 0.25 0.50 0.75 1.00 0 0.25 0.50 0.75 1.00
Fraction of queries Recall Avg. sensitivity up to the 1st FP

d e 0.6 f
CE 158,222× Multidomain 1.00
0.75 1 week

Dali alignment F1 score


TM-align
Dali
34,822×
19,989×
0.4
1 day
Query coverage

TM-align−fast 3,289× 0.75


0.2

Sensitivity
0.50 1 hour
0
Foldseek-TM 47× 0.50
CLE−SW
1.0 HOMSTRAD
1 min 11×
0.8
0.25 9s Foldseek 1×
0.6 Foldseek-TM Dali 0.25
MMseqs2 0.3×
TM-align CE
0.4
TM-align−fast CLE-SW
0.2 Foldseek MMseqs2
0 0
0
1 5 10 15 20 0 0.25 0.50 0.75 1.00 0 0.25 0.50 0.75 1.00
TP hits up to 1st FP Precision Foldseek alignment F1 score

Fig. 2 | Foldseek reaches similar sensitivities as structural aligners at AlphaFold2 protein models. One hundred queries, randomly selected from
thousands of times their speed. a, Cumulative distributions of sensitivity for AlphaFoldDB (version 1), were searched against this database. Per-residue
homology detection on the SCOPe40 database of single-domain structures. TPs query coverage (y axis) is the fraction of residues covered by at least x (x axis) TP
are matches within the same superfamily; FPs are matches between different matches ranked before the first FP match. e, Alignment quality for alignments of
folds. Sensitivity is the area under the ROC (AUROC) curve up to the first FP (see AlphaFoldDB (version 1) protein models (top panel), averaged over the top five
Supplementary Fig. 4 for family and fold). b, Precision-recall curve of SCOPe40 matches of each of the 100 queries. Sensitivity = TP residues in alignment / query
superfamilies (see Supplementary Fig. 4 for family and fold). c, Average length; precision = TP residues / alignment length. Reference-based alignment
sensitivity up to the first FP for family, superfamily and fold versus total runtime quality benchmark on HOMSTRAD alignments. f, Alignment quality comparison
on an AMD EPYC 7702P 64-core CPU for the all-versus-all searches of 11,211 between Foldseek and Dali for each HOMSTRAD family. The F1 score is the
structures of SCOPe40. d, Search sensitivity on multi-domain, full-length harmonic mean between sensitivity and precision.

Foldseek alignments are more accurate and sensitive than MMseqs2, Sequence-based analyses will soon be largely superseded by
CLE-SW and TM-align, similarly accurate as Dali and 13% less precise structure-based analyses. The main limitation in our view—the four
but 15% more sensitive than CE. In the reference-based HOMSTRAD orders of magnitude slower speed of structure comparisons—is
alignment quality benchmark24, Foldseek performs slightly below CE, removed by Foldseek.
Dali and TM-align (Fig. 2e). Figure 2f shows the comparison between
Foldseek and Dali in alignment quality for all HOMSTRAD families (see Online content
Supplementary Fig. 8 for example alignments). Any methods, additional references, Nature Portfolio reporting sum-
To find potentially problematic high-scoring Foldseek FPs, we maries, source data, extended data, supplementary information,
searched the set of unfragmented models in AlphaFoldDB (version 1) acknowledgements, peer review information; details of author contri-
with average predicted LDDT1≥80 against itself. We inspected the butions and competing interests; and statements of data and code avail-
1,675 (of 133,813) high-scoring FPs (score per aligned column ≥ 1.0, ability are available at https://doi.org/10.1038/s41587-023-01773-0.
TM-score < 0.5), revealing queries with multiple structured segments
but with incorrect relative orientations (Supplementary Table 3 and References
Supplementary Fig. 9). The folded segments were correctly aligned 1. Jumper, J. et al. Highly accurate protein structure prediction with
by Foldseek. This illustrates that 3D aligners such as TM-align may AlphaFold. Nature 596, 583–589 (2021).
overlook homologous structures that are not globally superposable, 2. Baek, M. et al. Accurate prediction of protein structures and
whereas Foldseek (as well as the two-dimensional (2D) aligner Dali) is interactions using a three-track neural network. Science 373,
independent of relative domain orientations and excels at detecting 871–876 (2021).
homologous multi-domain structures12. 3. Varadi, M. et al. AlphaFold Protein Structure Database:
We developed a webserver (https://search.foldseek.com) for massively expanding the structural coverage of protein–
multi-database searches, including AlphaFoldDB (version 4: Proteomes sequence space with high-accuracy models. Nucleic Acids Res.
and Swiss-Prot), AlphaFoldDB (version 4) and CATH25 clustered at 50% 50, D439–D444 (2022).
sequence identity, ESM Atlas-HQ and Protein Data Bank (PDB)26. 4. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein
We compared Foldseek webserver, TM-align and Dali using structure with a language model. Science 379, 1123–1130 (2023).
SARS-CoV-2 RdRp (PDB: 6M71, chain A (ref. 27); 942 residues) in Alpha- 5. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J.
FoldDB (version 1). Search times were 10 d for Dali, 33 h for TM-align and Basic local alignment search tool. J. Mol. Biol. 215,
6 s for Foldseek, making it 180,000 and 23,000 times faster. All top 10 403–410 (1990).
hits were known RdRp homologs (Supplementary Table 4). 6. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein
The availability of high-quality structures for nearly every sequence searching for the analysis of massive data sets.
folded protein is transformative for biology and bioinformatics. Nat. Biotechnol. 35, 1026–1028 (2017).

Nature Biotechnology | Volume 42 | February 2024 | 243–246 245


Brief Communication https://doi.org/10.1038/s41587-023-01773-0

7. Steinegger, M. et al. HH-suite3 for fast remote homology 22. Jiang, P. & Singh, M. SPICi: a fast clustering algorithm for large
detection and deep protein annotation. BMC Bioinformatics 20, biological networks. Bioinformatics 26, 1105–1111 (2010).
473 (2019). 23. Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local
8. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein superposition-free score for comparing protein structures
alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, and models using distance difference tests. Bioinformatics 29,
366–368 (2021). 2722–2728 (2013).
9. Mahlich, Y., Steinegger, M., Rost, B. & Bromberg, Y. HFSP: 24. Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P.
high speed homology-driven function annotation of proteins. HOMSTRAD: a database of protein structure alignments for
Bioinformatics 34, i304–i312 (2018). homologous families. Protein Sci. 7, 2469–2471 (1998).
10. Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten 25. Bordin, N. et al. AlphaFold2 reveals commonalities and novelties
times more conserved than sequence—a study of structural in protein structure space for 21 model organisms. Commun. Biol.
response in protein cores. Proteins 77, 499–508 (2009). 6, 160 (2023).
11. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment 26. Burley, S. K. et al. RCSB Protein Data Bank: powerful new tools
algorithm based on the TM-score. Nucleic Acids Res. 33, for exploring 3D structures of biological macromolecules for
2302–2309 (2005). basic and applied research and education in fundamental
12. Hasegawa, H. & Holm, L. Advances and pitfalls of protein biology, biomedicine, biotechnology, bioengineering and energy
structural alignment. Curr. Opin. Struct. Biol. 19, 341–348 (2009). sciences. Nucleic Acids Res. 49, D437–D451 (2021).
13. Holm, L. Using Dali for protein structure comparison. Methods 27. Gao, Y. et al. Structure of the RNA-dependent RNA polymerase
Mol. Biol. 2112, 29–42 (2020). from COVID-19 virus. Science 368, 779–782 (2020).
14. Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by 28. Van den Oord, A., Vinyals, O. & Kavukcuoglu, K. Neural discrete
incremental combinatorial extension (CE) of the optimal path. representation learning. Proc. of the 31st Conference on Neural
Protein Eng. 11, 739–747 (1998). Information Processing Systems. https://proceedings.neurips.cc/
15. Guyon, F., Camproux, A.-C., Hochez, J. & Tuffery, P. SA-Search: paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-
a web tool for protein structure mining based on a structural Paper.pdf (NIPS, 2017).
alphabet. Nucleic Acids Res. 32, W545–W548 (2004).
16. Ma, J. & Wang, S. Algorithms, applications, and challenges of Publisher’s note Springer Nature remains neutral with regard to
protein structure alignment. Adv. Protein Chem. Struct. Biol. 94, jurisdictional claims in published maps and institutional affiliations.
121–175 (2014).
17. Wang, S. & Zheng, W.-M. CLePAPS: fast pair alignment of protein Open Access This article is licensed under a Creative Commons
structures based on conformational letters. J. Bioinform. Comput. Attribution 4.0 International License, which permits use, sharing,
Biol. 6, 347–366 (2008). adaptation, distribution and reproduction in any medium or format,
18. Yang, J.-M. & Tung, C.-H. Protein structure database search and as long as you give appropriate credit to the original author(s) and the
evolutionary classification. Nucleic Acids Res. 34, 3646–3659 (2006). source, provide a link to the Creative Commons license, and indicate
19. de Brevern, A. G., Etchebest, C. & Hazout, S. Bayesian probabilistic if changes were made. The images or other third party material in this
approach for predicting backbone structures in terms of protein article are included in the article’s Creative Commons license, unless
blocks. Proteins 41, 271–287 (2000). indicated otherwise in a credit line to the material. If material is not
20. Durairaj, J., Akdel, M., de Ridder, D. & van Dijk, A. D. J. Geometricus included in the article’s Creative Commons license and your intended
represents protein structures as shape-mers derived from use is not permitted by statutory regulation or exceeds the permitted
moment invariants. Bioinformatics 36, i718–i725 (2020). use, you will need to obtain permission directly from the copyright
21. Chandonia, J.-M., Fox, N. K. & Brenner, S. E. SCOPe: classification holder. To view a copy of this license, visit http://creativecommons.
of large macromolecular structures in the structural classification org/licenses/by/4.0/.
of proteins—extended database. Nucleic Acids Res. 47,
D475–D481 (2019). © The Author(s) 2023

Nature Biotechnology | Volume 42 | February 2024 | 243–246 246


Brief Communication https://doi.org/10.1038/s41587-023-01773-0

Methods Pairwise local structural alignments


Overview After the prefilter has removed the vast majority of non-homologous
Foldseek enables fast and sensitive comparison of large structure sets. sequences, the structurealign module computes pairwise align-
It encodes structures as sequences over the 20-state 3Di alphabet and, ments for the remaining sequences using an SIMD-accelerated
thereby, reduces structural alignments to 3Di sequence alignments. The Smith–Waterman algorithm34,35. We extended this implementation
3Di alphabet developed for Foldseek describes tertiary residue–residue to support amino acid and 3Di scoring, compositional bias correction
interactions instead of backbone conformations and proved critical for and 256-bit-wide vectorization. The score linearly combines amino
reaching high sensitivities. Foldseek’s prefilter finds two similar, spaced acid and 3Di substitution scores with weights 1.4 and 2.1, respectively.
3Di k-mer matches in the same diagonal of the dynamic programming We optimized these two weights and the ratio of gap extend to gap
matrix. By not restricting itself to exact matches, the prefilter achieves open penalty on ~1% of alignments (all-versus-all on 10% of randomly
high sensitivity while reducing the number of sequences for which full selected SCOPe40 domains). A compositional bias correction is applied
alignments are computed by several orders of magnitude. Further to the amino acid and 3Di scores. To further suppress high-scoring FP
speed-ups are achieved by multi-threading and using single instruction, matches, for each match we align the reversed query sequence against
multiple data (SIMD) vector units. Owing to the SIMDe library (https:// the target and subtract the reverse bit score from the forward bit score.
github.com/simd-everywhere/simde), Foldseek runs on a wide range of
CPU architectures (x86_64, arm64 and ppc64le) and operating systems Structural bit score
(Linux and macOS). The core modules of Foldseek, which build on the We rank hits by a ‘structural bit’ score—that is, the product of the bit
MMseqs2 framework6, are described in the following paragraphs. score produce by the Smith–Waterman algorithm and the geometric
mean of average alignment LDDT and the alignment TM-score.
Create database
The createdb module converts a set of PDB (ref. 29), macromolecular Fast alignment LDDT computation
crystallographic information file (mmCIF) formatted files or Foldcomp To improve the LDDT score computation speed, we store the 3D coor-
compressed structure (FCZ (ref. 30)) files into an internal Foldseek dinates of the query in a grid using spatial hashing. Each grid cell spans
database format using the Gemmi package (https://gemmi.readthe- 15 Å, which is the default radius considered for the LDDT computation.
docs.io/en/latest/) or the Foldcomp library. The format is compatible For each aligned query residue i, we compute the distances to all Cα
with the MMseqs2 database format, which is optimized for parallel atoms within a 15 Å radius by searching all neighboring grid cells of the
access. We store each chain as a separate entry in the database. The query residue’s grid cell. For each residue j, we compute the distance
module follows the MMseqs2 createdb module logic. However, in between the Cα atoms of i and j and the distance of the corresponding
addition to the amino acid sequence, it computes the 3Di sequence aligned target residues. Query and target distances for the aligned
from the 3D atom coordinates of the backbone atom and Cβ coordi- pairs are subtracted, and the differences d are transformed into LDDT
nates (see the ‘Descriptors for 3Di structural alphabet’ and ‘Optimize scores s = 0.25 × ((d < 0.5) + (d < 1.0) + (d < 2.0) + (d < 4.0)). For each i, we
nearest-neighbor selection’ subsections). Backbone atom and Cβ obtain the means of the scores for all Cα atoms j within the 15 Å radius of
coordinates are needed only for the nearest-neighbor selection. For i. The LDDT score is the mean of these means over all query residues i.
Cα-only structures, Foldseek reconstructs backbone atom coordinates
using PULCHRA31. Missing Cβ coordinates (for example, in glycines) are E values
defined such that the four groups attached to the Cα are arranged at the To estimate E values for each match, we trained a neural network to
vertices of a regular tetrahedron. The 3Di and amino acid sequences predict the mean μ and scale parameter λ of the extreme value distri-
and the Cα coordinates are stored in the Foldseek database. To save disk bution for each query. The module computemulambda takes a query
space, we optionally compress the Cα coordinates losslessly, beginning and database structures as input and aligns the query against a ran-
with three uncompressed 4-byte floating-point Cα coordinates and stor- domly shuffled version of the database sequences. For each query
ing all subsequent coordinates as 2-byte signed integer differences32. sequence, the module produces N random alignments and fits to their
If any difference is too large to be represented with a 2-byte signed scores an extreme value (Gumbel) distribution. The maximum likeli-
integer, we fall back to 4-byte floats for all Cα coordinates. hood fitting is done using the Gumbel fitting function taken from
HMMER3 (hmmcalibrate)36. To train the neural network, it is critical
Prefilter to use query and target proteins that include problematic regions,
The prefilter module detects double matches of similar, spaced such as structurally biased, disordered or badly modeled regions that
words (k-mers) that occur on the same diagonal. The k-mer size is occur ubiquitously in full-length proteins or modeled structures. We,
dynamically set to k = 6 or k = 7 depending on the size of the target therefore, trained the network on 100,000 structures sampled from
database. Similar k-mers are those with a 3Di substitution matrix score the AlphaFoldDB (version 1). We trained a neural network to predict μ
above a certain threshold, whereas MMseqs2 uses an amino acid substi- and λ from the amino acid composition of the query and its length (so
tution matrix to compute the similarity (see the ‘3Di substitution score a scrambled version of the query sequence would produce the same
matrix’ subsection). The gapless double-match criterion suppresses μ and λ). The network has 22 input nodes, two fully connected layers
hits to non-homologous structures effectively, as they are less likely to with 32 nodes each (ReLU activation) and two linear output nodes. The
have consecutive k-mer matches on the same diagonal by chance. To optimizer Adam with learning rate 0.001 was used for training. When
avoid FP matches due to regions with biased 3Di sequence composi- testing the resulting E values on searches with scrambled sequences,
tion, a compositional bias correction is applied in a way analogous to the log of the mean number of FPs per query turned out to have an
MMseqs2 (ref. 33). For each hit, we perform an ungapped alignment accurately linear dependence on the log of the reported E values, albeit
over the diagonals with double, consecutive, similar k-mer matches with a slope of 0.32 instead of 1. We, therefore, correct the E values from
and sort those by the maximum ungapped diagonal score. Alignments the neural network by taking them to the power of 0.32. We compared
with a score of at least 15 bits are passed on to the next stage. We imple- how well the mean number of FPs at a given E value agreed with the
mented an optional taxonomy filter within the prefiltering step to help E values reported by Foldseek, MMseqs2 and 3D-Blast (Supplementary
users search through taxonomic subsets of the target database. After Fig. 10; see Supplementary Fig. 11 for AlphaFoldDB). We considered a
the gapless double-diagonal matching stage and before the ungapped hit as FP if it was in a different fold and had a TM-score lower than 0.3.
alignment stage, we reject all potential target hits that do not lie within Furthermore, we ignored all cross-fold hits within the four-bladed to
a taxonomic clade specified by the user. eight-bladed β-propeller superfamilies (SCOPe b.66-b.70) and within

Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01773-0

the Rossman-like folds (c.2–c.5, c.27, c.28, c.30 and c.31) because of the distance ∣Cα,i − Cα, j∣ describe the conformation between the backbone
extensive cross-fold homologies within these groups37. fragments. In addition, we encode the sequence distance with the two
features sign (i − j) min(|i − j|, 4) and sign (i − j) log(|i − j| + 1).
Probability of TP match
Foldseek computes for each match a simple estimate for the probability Learning the 3Di states using a vector quantized variational
that the match is a TP match given its structural bit score. Here, hits autoencoder
within the same superfamily are TP; hits to another fold are FP; and hits The 10-dimensional descriptors were discretized into an alphabet of
to the same family or to another superfamily are ignored. We estimate 20 states using a vector quantized variational autoencoder (VQ-VAE)28.
the structural bit score distributions of TP and FP hits (p(score∣TP) In contrast to standard clustering approaches such as k-means, VQ-VAE
and p(score∣FP)), which allow us to calculate the probability of a is a nonlinear approach that can optimize decision surfaces for each
p( score |TP) p(TP) of its states. In contrast to the standard VQ-VAE, we trained the VQ-VAE
TP p(TP|score) = . Both score distributions
p( score |TP) p(TP)+p( score |FP) p(FP) not as a simple generative model but, rather, to learn states that are
were fitted on SCOPe40 with a mixture model consisting of two gamma maximally conserved in evolution. To that end, we trained it with pairs
distributions (resulting in five parameters for each function). For the of descriptors xn , yn ∈ ℝ10 from structurally aligned residues, to predict
fitting, the function gammamixEM from the R package mixtools38 was the distribution of yn from xn.
used. We excluded cross-fold hits between certain folds as in the E value The VQ-VAE consists of an encoder and decoder network with the
estimation. For example, Foldseek finds around the same number of discrete latent 3Di state as a bottleneck in between. The encoder net-
FPs and TPs with a score of 51 in SCOPe40. The probability for a hit with work embeds the 10-dimensional descriptor xn into a 2D continuous
score 51 is, therefore, 50%. latent space, where the embedding is then discretized by the nearest
centroid, each centroid representing a 3Di state. Given the centroid,
Pairwise global structural alignments using TM-align the decoder predicts the probability distribution of the descriptor yn
We also offer the option to use TM-align for pairwise structure align- of the aligned residue. After training, only encoder and centroids are
ment instead of the 3Di-based alignment. We implemented TM-align used to discretize descriptors. Encoder and decoder networks are both
based on the Cα atom coordinates and made adjustments to improve fully connected with two hidden layers of dimension 10, a batch nor-
the (1) speed and (2) memory usage. (1) TM-align performs multiple malization after each hidden layer and ReLU as activation functions.
floating-point-based Needleman–Wunsch (NW) alignment steps while The encoder, centroids and decoder have 242, 40 and 352 parameters,
applying different scoring functions (for example, score secondary respectively. The output layer of the decoder consists of 20 units pre-
structure, Euclidean distance of superposed structures or fragments). dicting μ and σ2 of the descriptors x of the aligned residue, such that
TM-align’s NW code did not take advantage of SIMD instructions; the decoder predicts 𝒩𝒩(x|μ, Iσ2 ) (with diagonal covariance).
therefore, we replaced it by parasail’s39 SIMD-based NW implemen- We trained the VQ-VAE on the loss function defined in Equation (3)
tation and extended it to support the different scoring functions. in ref. 28 (with commitment loss = 0.25) using the deep learning frame-
We also replaced the TM-score computation using fast_protein_clus- work PyTorch (version 1.9.0), the Adam optimizer, with a batch size of
ter’s SIMD-based implementation40. Our NW implementation does 512, and a learning rate of 10−3 over four epochs. Using Kerasify (https://
not compute exactly the same alignment because we apply affine github.com/moof2k/kerasify), we integrated the encoder network into
gap costs, whereas TM-align does not (Supplementary Fig. 12). Foldseek. The domains from SCOPe40 were split 80%/20% by fold into
(2) TM-align requires 17 bytes × query length × target length of memory, training and validation sets. For the training, we aligned the structures
and we reduce the constant overhead from 17 bytes to 4 bytes. If Fold- with TM-align, removed all alignments with a TM-score below 0.6
seek is used in TM-align mode (parameter --alignment-type 1), and removed all aligned residue pairs with a distance between their
TM-align is used for the alignment stage after the prefilter step, where Cα atoms of more than 5 Å. We trained the VQ-VAE with 100 different
we replace the reported E value column with TM-scores normalized initial parameters and chose the model that was performing best in the
by the query length. The results are ordered in descending order by benchmark on the validation dataset (the highest sum of ratios between
average TM-score by default. 3Di AUC and TM-align AUC for family, superfamily and fold level).

Descriptors for 3Di structural alphabet 3Di substitution score matrix


The 3Di alphabet describes the tertiary contacts between residues and We trained a BLOSUM-like substitution matrix for 3Di sequences from
their nearest neighbors in 3D space. For each residue i, the conforma- pairs of structurally aligned residues used for the ‘VQ-VAE training’.
tion of the local backbone around i, together with the local backbone First, we determined the 3Di states of all residues. Next, the substitution
around its nearest neighbor j, is approximated by 20 discrete states frequencies among 3Di states were calculated by counting how often
(Supplementary Fig. 3). We chose the alphabet size A = 20 as a tradeoff two 3Di states were structurally aligned. (Note that the substitution
between encoding as much information as possible (large A; Supple- frequencies from state A to state B and the opposite direction are
p(x,y)
mentary Fig. 13) and limiting the number of similar 3Di k-mers that we equal.) Finally, the score S (x, y) = 2 log2 p(x) p(y) for substituting state x
need to generate in the k-mer-based prefilter, which scales with Ak. The through state y is the log-ratio between the substitution frequency
discrete single-letter states are formed from neighborhood descrip- p(x, y) and the probability that the two states occur independently,
tors containing 10 features encoding the conformation of backbones scaled by the factor 2.
around residues i and j represented by the Cα atoms (Cα,i−1, Cα,i, Cα,i+1)
and (Cα, j−1, Cα, j, Cα, j+1). The descriptors use the five unit vectors along 3Di alphabet cross-validation
the following directions: We trained the 3Di alphabet (the VQ-VAE weights) and the substitution
matrix by four-fold cross-validation on SCOPe40. We split the SCOPe40
u1 ∶ Cα,i−1 → Cα, i u4 ∶ Cα, j → Cα, j+1
dataset into four parts, such that all domains of each fold ended up in
u2 ∶ Cα, i → Cα, i+1 u5 ∶ Cα, i → Cα, j the same part of the four parts. 3Di alphabets were trained on three
u3 ∶ Cα, j−1 → Cα, j . parts and tested on the remaining part, selecting each of the four parts
in turn as a test set. The 80:20 split between training and validation sets
to select the best alphabet out of the 100 VQ-VAE runs happens within
We define the angle between uk and ul as ϕkl, so cos ϕkl = uTk ul. The seven the 3/4 of the cross-validation training data. Supplementary Fig. 14
features cos ϕ12 , cos ϕ34 , cos ϕ15 , cos ϕ35 , cos ϕ14 , cos ϕ23 , cos ϕ13 and the shows the mean sensitivity (black) and the standard deviation (gray

Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01773-0

area) in comparison to the final 3Di alphabet, for which we trained the alignment quality benchmarks (Fig. 2d,e). Tools with multi-threading
3Di alphabet on the entire SCOPe40 (red). No overfitting was observed, support (MMseqs2 and Foldseek) were executed with 64 threads; tools
despite training 492 parameters (282 neural network and 210 substitu- without were parallelized by breaking the query set into 64 equally
tion matrix entries). In Fig. 2, we, therefore, show the benchmark results sized chunks and executing them in parallel.
for the final 3Di alphabet, trained on the entire SCOPe40.
Reference-free multi-domain benchmarks
Nearest-neighbor selection We devised two reference-free benchmarks that do not rely on any
To select nearest-neighbor residues that maximize the performance reference structural alignments. We clustered the AlphaFoldDB (ver-
of the resulting 3Di alphabet in finding and aligning homologous sion 1)3 using SPICi22. For this, we first aligned all protein sequences all
structures, we introduced the virtual center V of a residue. The virtual against all using an E value threshold <10−3 using BLAST (2.5.0+)5. SPICi
center position is defined by the angle θ (V-Cα-Cβ), the dihedral angle produced 34,270 clusters from the search result. For each cluster, we
τ (V-Cα-Cβ-N) and the length l (∣V − Cα∣) (Supplementary Fig. 1). For each picked the longest protein as representative. We randomly selected 100
residue i, we selected the residue j with the smallest distance between representatives as queries and searched the set of remaining structures.
their virtual centers. The virtual center was optimized on the training The top five alignments of all queries are listed at https://wwwuser.
and validation structure sets used for the VQ-VAE training by creating gwdg.de/~compbiol/foldseek/multi_domain_top5_alignments/.
alphabets for positions with θ ∈ [0, 2π], τ ∈ [ − π, π] in 45∘ steps and For the evaluation, we needed to adjust the LDDT score function
l ∈ {1.53 Å k: k ∈ {1, 1.5, 2, 2.5, 3}} (1.53 Å is the distance between C α taken from AlphaFold2 (ref. 1). LDDT calculates for each residue i in the
and Cβ). The virtual center defined by θ = 270∘, τ = 0∘ and l = 2 performed query the fraction of residues in the 15 Å neighborhood that have a dis-
best in the SCOPe benchmark. tance within 0.5, 1, 2 or 4 Å of the distance between the corresponding
This virtual center preferably selects long-range, tertiary interac- residues in the target23. The denominator of the fraction is the number
tions and only falls back to selecting interactions to i + 1 or i − 1 when no of 15 Å neighbors of i that are aligned to some residue in the target. This
other residues are nearby. In that case, the interaction captures only does not properly penalize non-compact models in which each residue
the backbone conformation. has few neighbors within 15 Å. We, therefore, use as denominator the
total number of neighboring residues within 15 Å of i.
SCOPe benchmark For the alignment quality benchmark (Fig. 2e), we classified each
We downloaded the SCOPe40 structures (available at https://wwwuser. aligned residue pair as TP or FP depending on its residue-wise LDDT
gwdg.de/~compbiol/foldseek/scop40pdb.tar.gz). score—that is, the fraction of distances to its 15 Å neighbors that are
The SCOPe benchmark set consists of single domains with an aver- within 0.5, 1, 2 and 4 Å of the distance to the corresponding residues in
age length of 174 residues. In our benchmark, we compare the domains the query, averaged over the four distance thresholds. TP residues are
all-versus-all. Per domain, we measured the fraction of detected TPs those with a residue-wise LDDT score of at least 0.6 and FPs below 0.25,
up to the first FP. For family-level, superfamily-level and fold-level ignoring matches in between. For the search sensitivity benchmark
recognition, TPs were defined as same family, same superfamily and (Fig. 2d), TP residue–residue matches are those with an LDDT score of
not same family and same fold and not same superfamily, respectively. the query-target alignment of at least 0.6 and FPs below 0.25, ignoring
Hits from different folds are FPs. matches in between. (The LDDT score of the query-target alignment
is the average of the residue-wise LDDT score over all aligned residue
Evaluation SCOPe benchmark pairs.) The choice of thresholds is illustrated in Supplementary Fig. 6.
After sorting the alignment result of each query (described in the ‘Tools The benchmark for other thresholds is shown in Supplementary Fig. 7.
and options for benchmark comparison’ subsection), we calculated
the sensitivity as the fraction of TPs in the sorted list up to the first FP, All-versus-all search of AlphaFoldDB with Foldseek
all excluding self-hits. For comparison, we took the mean sensitivity We downloaded the AlphaFoldDB (version 1)3 containing 365,198 pro-
over all queries for family-level, superfamily-level and fold-level clas- tein models and searched it all-versus-all using Foldseek -s 9.5 –
sifications. We evaluated only SCOPe members with at least one other max-seqs 2000. For our second-best hit analysis, we consider only
family, superfamily and fold member. We measure the sensitivity up models with (1) an average Cαʼs predicted LDDT (pLDDT) greater than
to the 1st FP (ROC1) instead, for example, up to the 5th FP, because or equal to 80 and (2) models of non-fragmented domains. We also
ROC1 better reflects the requirements for low false discovery rates in computed the structural similarity for each pair using TM-align (default
automatic searches. options).
Additionally, we plotted precision-recall curves for each tool
(Fig. 2b and Supplementary Fig. 4). After sorting the alignment Tools and options for benchmark comparison
results by the structural similarity scores (as described in the ‘Tools Owing to dataset overlap, we excluded methods from the benchmark
and options for benchmark comparison’ subsection) and excluding that are likely to be overfitted on SCOPe. This applies to methods that
self-matches, we generated a weighted precision-recall curve for trained many thousands of parameters (for example, deep neural
family-level, superfamily-level and fold-level classifications (preci- networks) with strong data leakage among training, validation and test
sion = TP / (TP + FP) and recall = TP / (TP + FN)). All counts (TP, FP and sets. For example, several tools allowed up to 40% sequence identity
FN) were weighted by the reciprocal of their family, superfamily or fold between sets. The following command lines were used in the SCOPe as
size. In this way, folds, superfamilies and families contribute linearly well as the multi-domain benchmark:
with their size instead of quadratically36.
Foldseek
Runtime evaluations on SCOPe and AlphaFoldDB We used Foldseek commit aeb5e during this analysis. Foldseek was
We measured the speed of structural aligners on a server with an AMD run with the following parameters: --threads 64 -s 9.5 -e 10
EPYC 7702P 64-core CPU and 1,024 GB RAM memory. On SCOPe40, we --max-seqs 2000.
measured or estimated the runtime for an all-versus-all comparison.
To avoid excessive runtimes for TM-align, Dali and CE, we estimated Foldseek-TM
the runtime by randomly selecting 10% of the 11,211 SCOPe domains For the Foldseek-TM benchmark, we first run a regular 3Di/AA-based
as queries. We measured runtimes on AlphaFoldDB for searches with Foldseek search using the following parameters: --threads 64 -s
the same 100 randomly selected queries used for the sensitivity and 9.5 -e 10 --max-seqs 4000 --alignment-mode 1. All hits

Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01773-0

passing are then aligned by Foldseeks’s tmalign --tmalign-fast 1 Dali


--tmscore-threshold 0.0 -a. We used Foldseek commit aeb5e dur- We installed the standalone DaliLite.v5. For the SCOPe40 benchmark
ing this analysis. We expose Foldseek-TM in our command-line interface set, input files were formatted in DAT files with Dali’s import.pl. The
as a search mode that combines regular Foldseek 3Di/AA-based work- conversion to DAT format produced 11,137 valid structures out of the
flow with our TM-align implementation within the tmalign module. 11,211 initial structures for the SCOPe benchmark and 34,252 structures
out of 34,270 SPICi clusters. After formatting the input files, we calcu-
MMseqs2 lated the protein alignment with Dali’s structural alignment algorithm.
We used the default MMseqs2 (release 13-45111) search algorithm to The results were sorted by Dali’s z-score in descending order:
obtain the sequence-based alignment result. MMseqs2 sorts the results
by E value and score. We searched with: --threads 64 -s 7.5 -e import.pl –pdbfile query.pdb –pdbid PDBid –dat DAT
10000 --max-seqs 2000. dali.pl –cd1 queryDATid –db targetDB.list –TITLE
systematic –dat1 DAT –dat2 DAT –outfmt "summary"
CLE-Smith–Waterman –clean
We used PDB Tool version 4.80 (https://github.com/realbigws/
PDB_Tool) to convert the benchmark structure set to CLE sequences. CE
After the conversion, we used SSW35 (commit ad452e) to align CLE We used BioJava’s42 (version 5.4.0) implementation of the combinatorial
sequences all-versus-all. We sorted the results by alignment score. extension (CE) alignment algorithm. We modified one of the modules of
The following parameters were used to run SSW: (1) protein align- BioJava under shape configuration to calculate the CE value. Our modi-
ment mode (-p); (2) gap open penalty of 100 (-o 100); (3) gap fied CEalign.jar file requires a list of query files, path to the target PDB
extend penalty of 10 (-e 10); (4) CLE’s optimized substitution matrix files and an output path as input parameters. This Java module runs an
(-a cle.shen.mat); and (5) returning alignment (-c). The gap open all-versus-all CE calculation with unlimited gap size (maxGapSize -1)
and extend values were inferred from DeepAlign41. The results are to improve alignment results14. The results were sorted by z-score in
sorted by score in descending order. descending order. For the multi-domain benchmark, we excluded one
query that was running over 16 d. The Jar file of our implementation of
ssw_test -p -o 100 -e 10 -a cle.shen.mat -c CE calculation is provided (see ‘Code availability’).

3D-BLAST java -jar CEalign.jar querylist.txt


We used 3D-BLAST (beta102) with BLAST+ (2.2.26) and SSW34 (version TargetPDBDirectory OutputDirectory
ad452e). We first converted the PDB structures to a 3D-BLAST database
using 3d-blast -sq_write and 3d-blast -sq_append. We searched Geometricus
the structural sequences against the database using blastp with the We included Geometricus20 in the SCOPe benchmark as a representa-
following parameters: (1) 3D-BLAST’s optimized substitution matrix tive of alignment-free tools, which are fast but can find only globally
(-M 3DBLAST); (2) number of hits and alignments shown of 12,000 similar structures. Geometricus discretizes fixed-length backbone
(-v 12000 -b 12000); (3) E value threshold of 1,000 (-e 1000); (4) fragments (shape-mers) using their 3D moment invariants and repre-
disabling query sequence filter (-F F); (5) gap open of 8 (-G 8); and sents structures as a fixed-length count vector over the shape-mers. To
(6) gap extend of 2 (-E 2). 3D-BLAST’s results are sorted by E value in calculate the shape-mer-based structural similarity of the benchmark
ascending order: set, we used Caretta-shape’s Python implementation (1e3adb0) of
multiple structure alignment (https://github.com/TurtleTools/caretta/
blastall -p blastp -M 3DBLAST -v 12000 -b 12000 -e caretta/multiple_alignment.py), which computes the Bray–Curtis
1000 -F F -G 8 -E 2 similarity between the Geometricus shape-mer vectors. Our modified
version extracts structural information from the input files and gener-
For Smith–Waterman, we used (1) gap open of 8; (2) gap extend of ates all-versus-all pairwise structural similarity score as an output. We
2; (3) returning alignments (-c); (4) 3D-BLAST’s optimized substitution ran Geometricus on a single core because it would require substantial
matrix (-a 3DBLAST); and (5) protein alignment mode (-p): ssw_test engineering efforts to implement parallelization on multiple cores.
-o 8 -e 2 -c -a 3DBLAST -p. We noticed that the 3D-BLAST matrix With an efficient multi-core implementation, Geometricus might be as
with Smith–Waterman resulted in a similar performance to CLE: 0.717, fast as MMseqs2 on 64 cores. The Python code of our implementation
0.230 and 0.011 for family classification, superfamily classification and of Geometricus is provided:
fold classification, respectively. We excluded 3D-BLAST’s measurement
from the multi-domain benchmark because it produced occasionally python runGeometricus_caretta.py -i querylist.txt
high scores (>107) for single residue alignments. -o OutputDirectory

TM-align HOMSTRAD alignment benchmark


We downloaded and compiled the TMalign.cpp source code (version The HOMSTRAD database contains expert-curated homologous
2019/08/22) from the Zhang group website. We ran the benchmark structural alignments for 1,032 protein families24. We downloaded
using default parameters and -fast for the fast version. TM-align the latest HOMSTRAD version (https://mizuguchilab.org/homstrad/
reports two TM-scores: (1) normalized by the length of 1st chain (query) data/homstrad_with_PDB_2022_Aug_1.tar.gz) and picked the pairwise
or (2) normalized by the length of the 2nd chain (target). We used the alignments between the first and last members of each family, which
average of TM-scores normalized by the 1st chain (query) and 2nd chain resulted in structures of a median length of 182 residues. We used the
(target) in all our analyses. We evaluated TM-align’s performance by same parameters as in the SCOPe and multi-domain benchmark. We
sorting the results based on both the query TM-score and the minimum, forced Foldseek, MMseqs2 and CLE-Smith–Waterman to return an
maximum and average TM-score for both the query and target. Our alignment by switching off the prefilter and E value threshold. With the
results showed that the average TM-score performed the best in our HOMSTRAD alignments as reference, we measured for each pairwise
single-domain benchmark. alignment the sensitivity (fraction of residue pairs of the HOMSTRAD
Default: TMalign query.pdb target.pdb alignment that were correctly aligned) and the precision (fraction of
Fast: TMalign query.pdb target.pdb -fast correctly aligned residue pairs in the predicted alignment). Dali, CE

Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01773-0

and CLE-Smith–Waterman failed to produce an alignment for 35, 1 and Data availability
1 out of 1,032 pairs, respectively, which were rated with a sensitivity Benchmark data are available at https://wwwuser.gwdg.de/~compbiol/
of 0. The mean sensitivity and precision are shown in Fig. 2e, and all foldseek.
individual alignments are listed in homstrad_alignments.txt at
https://wwwuser.gwdg.de/~compbiol/foldseek/. Code availability
Foldseek is GPLv3-licensed free open-source software. The source code
Limitations of benchmarks and binaries for Foldseek can be downloaded at https://github.com/
The SCOPe benchmark to measure search sensitivity uses only steineggerlab/foldseek. The webserver code is available at https://
single-domain proteins as queries and targets (Fig. 2a–c). It, therefore, github.com/soedinglab/mmseqs2-app. The analysis scripts are avail-
cannot assess the ability of tools to find local similarities—for example, able at https://github.com/steineggerlab/foldseek-analysis.
finding homologous domains shared between two multi-domain pro-
teins. The alignment benchmark based on HOMSTRAD (Fig. 2e) has References
the same limitation, as the vast majority of proteins in HOMSTRAD 29. Burley, S. K. et al. Protein Data Bank: the single global archive
have a single domain (median length = 182 residues). A drawback of for 3D macromolecular structure data. Nucleic Acids Res. 47,
our reference-free multi-domain benchmark is the need to choose D520–D528 (2019).
thresholds for TPs and FPs (Supplementary Fig. 6). 30. Kim, H., Mirdita, M. & Steinegger, M. Foldcomp: a library and
format for compressing and indexing large protein structure sets.
Pre-built and ready-to-download databases Bioinformatics 39, btad153 (2023).
Foldseek includes the databases module to aid users with the down- 31. Rotkiewicz, P. & Skolnick, J. Fast procedure for reconstruction
load and setup of structural databases. Currently, we include the four of full-atom protein models from reduced representations.
variants of the AlphaFoldDB (version 4): UniProt (214 million struc- J. Comput. Chem. 29, 1460–1465 (2008).
tures), UniProt50, a clustered database to 50% sequence identity and 32. Valasatava, Y. et al. Towards an efficient compression of 3D
90% bi-directional coverage using MMseqs2 (parameters -c 0.9 coordinates of macromolecular structures. PLoS ONE 12,
--min-seq-id 0.5 --cluster-reassign; 54 million structures), e0174846 (2017).
Proteome (564,000 structures) and Swiss-Prot (542,000 structures). 33. Hauser, M., Steinegger, M. & Söding, J. MMseqs software suite for
Additionally, we regularly build and offer a 100% sequence identity clus- fast and deep clustering and searching of large protein sequence
tered PDB. The update pipeline is available in the util/update_web- sets. Bioinformatics 32, 1323–1330 (2016).
server_pdb folder in the main Foldseek repository. These databases 34. Farrar, M. Striped Smith–Waterman speeds database searches
are hosted on Cloudflare R2 for fast downloading. We additionally six times over other SIMD implementations. Bioinformatics 23,
link to and provide an automatic setup procedure for the ESM Atlas 156–161 (2007).
High-Quality Clu304 database. 35. Zhao, M., Lee, W.-P., Garrison, E. P. & Marth, G. T. SSW library:
an SIMD Smith–Waterman C/C++ library for use in genomic
Webserver applications. PLoS ONE 8, e82138 (2013).
The Foldseek webserver is based on the MMseqs2 webserver43. To allow 36. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol.
for searches in seconds, we implemented MMseqs2ʼs pre-computed 7, e1002195 (2011).
database indexing capabilities in Foldseek. Using these, the search data- 37. Söding, J. & Remmert, M. Protein sequence comparison and fold
bases can be fully cached in system memory by the operating system and recognition: progress and good-practice benchmarking. Curr.
instantly accessed by each Foldseek process, thus avoiding expensive Opin. Struct. Biol. 21, 404–411 (2011).
accesses to slow disk drives. A similar mechanism was used to store and 38. Benaglia, T., Chauveau, D., Hunter, D. R. & Young, D. mixtools: an
read the associated taxonomic information. The AlpaFoldDB/UniProt50 R package for analyzing finite mixture models. J. Stat. Softw. 32,
(version 4), AlphaFoldDB/Proteome (version 4), AlphaFoldDB/Swiss-Prot 1–29 (2009).
(version 4), CATH50, ESM Atlas High-Quality Clu30 and PDB100 require 39. Daily, J. Parasail: SIMD C library for global, semi-global, and local
191 GB, 3.8 GB, 3.4 GB, 1.4 GB, 110 GB and 2.0 GB RAM, respectively. The pairwise sequence alignments. BMC Bioinformatics 17, 81 (2016).
databases are kept in memory using vmtouch (https://github.com/hoy- 40. Hung, L.-H. & Samudrala, R. fast_protein_cluster: parallel and
tech/vmtouch). Databases are only required to remain resident in RAM if optimized clustering of large-scale protein modeling data.
Foldseek is used as a webserver. During batch searches, Foldseek adapts Bioinformatics 30, 1774–1776 (2014).
its memory use to the available RAM of the machine. We implemented a 41. Jiménez-Moreno, A., Strelák, D., Filipovic, J., Carazo, J. M. &
structural visualization using the NGL viewer44 to aid the investigation of Sorzano, C. O. S. DeepAlign, a 3D alignment method based
pairwise hits. Because we only store Cα traces of the database proteins, on regionalized deep learning for Cryo-EM. J. Struct. Biol. 213,
we use PULCHRA30 to complete the backbone of these sequences, and 107712 (2021).
also of the query if necessary, to enable a ribbon visualization45 of the 42. Lafita, A. et al. BioJava 5: a community driven open-source
proteins. For a high-quality superposition, we use TM-align11 to super- bioinformatics library. PLoS Comput. Biol. 15, e1006791 (2019).
pose the structures based on the Foldseek alignment. Both PULCHRA 43. Mirdita, M., Steinegger, M. & Söding, J. MMseqs2 desktop and
and TM-align are executed within the users’ browser using WebAssembly. local web server app for fast, interactive sequence searches.
They are available as pulchra-wasm and tmalign-wasm on the npm Bioinformatics 35, 2856–2858 (2019).
package repository as free open-source software. 44. Rose, A. S. et al. NGL viewer: web-based molecular graphics for
large complexes. Bioinformatics 34, 3755–3758 (2018).
Structure prediction in the webserver 45. Richardson, J. S. Early ribbon drawings of proteins. Nat. Struct.
We use the ESM Atlas API to predict structures of user-supplied Biol. 7, 624–625 (2000).
sequences that are, at most, 400 residues long. This enables
sequence-to-structure searches in the webserver. Acknowledgements
We thank N. Bordin, I. Sillitoe and C. Orengo for reporting issues and
Reporting summary providing valuable feedback; Y. Zhang, P. Rotkiewicz and M. Wojdyr for
Further information on research design is available in the Nature Port- making TM-align, PULCHRA and the Gemmi library freely accessible;
folio Reporting Summary linked to this article. and D.-Y. Kim for creating the Foldseek logo.

Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01773-0

M.S. acknowledges support from the National Research Funding


Foundation of Korea (NRF) (grants 2019R1A6A1A10073437, Open access funding provided by Max Planck Society.
2020M3A9G7103933, 2021R1C1C102065 and 2021M3A9I
4021220), the Samsung DS Research Fund and the Creative- Competing interests
Pioneering Researchers Program through Seoul National The authors declare no competing interests.
University. S.K. acknowledges support by NRF grant
2019R1A6A1A10073437. J.S. acknowledges support by Additional information
the German Ministry for Education and Research Supplementary information The online version contains supplementary
(horizontal4meta). We used the compute cluster at the material available at https://doi.org/10.1038/s41587-023-01773-0.
Gesellschaft für wissenschaftliche Datenverarbeitung
mbH Göttingen (GWDG). Correspondence and requests for materials should be addressed to
Johannes Söding or Martin Steinegger.
Author contributions
M.K., S.K., J.S. and M.S. designed the research. M.K., S.K., C.T., M.M. Peer review information Nature Biotechnology thanks the anonymous
and M.S. developed code and performed analyses. M.K. and J.S. reviewers for their contribution to the peer review of this work.
developed the 3Di alphabet. J.L. implemented the fast LDDT code.
M.M. and C.L.M.G. developed the webserver. M.K., S.K., C.T., M.M., Reprints and permissions information is available at
J.S. and M.S. wrote the manuscript. www.nature.com/reprints.

Nature Biotechnology

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy