ESM2 for candidate sequence filtering 🤖

Once one has generated de novo protein sequences using a tool like LigandMPNN, one must rank them to select promising candidates for experimental validation. One powerful approach is to use protein language models like Meta's ESM2. These language models rely on a BERT-like architecture and a Masked Language Modeling (MLM) objective to learn rich representations of protein sequences. Note that this Space pairs well with the companion RFdiffusion3, LigandMPNN and RosettaFold3 Spaces for a full de novo design pipeline!

ESM is used for two main purposes:

Generating embeddings: ESM's hidden layers creates high-dimensional representations of protein sequences that capture structural and functional information. These embeddings can be used as input features for downstream machine learning models to predict function, properties or even for folding. Embeddings can also be used with dimensionality reduction techniques like t-SNE to visualize to identify clusters or compare against known proteins.
Calculating pseudo-perplexity scores (PPL): The lower this score is for a given input sequence, the more "natural" or "plausible" it is according to the model's learned distribution. Such scores are often used as a filtering criterion in de novo design, as sequences with lower PPL are more likely to express properly in the lab and fold into stable structures. PPL scores provide an orthogonal evaluation metric to structure-based methods like RosettaFold.

How to use this Space:

Choose the ESM2 model: models mainly differ by the number of parameters (8M, 35M, 650M). Larger models produce better PPL scores and richer embeddings but have longer runtimes.
Upload one or more FASTA files containing your candidate sequences.
Choose the batch size: it controls how many sequences are processed together. Larger batch sizes can speed up processing but require more GPU memory.
Choose between generating embeddings or calculating pseudo-perplexity scores.

Note that calculating PPL scores is much more computationally intensive than generating embeddings, it scales cubically with sequence length $L$. This is because calculating PPL requires $L$ forward passes through the model, each with a different token masked out. For long sequences or large numbers of sequences, we recommend using the approximate PPL calculation, which masks 10% of tokens at a time and thus only scales quadratically with sequence length. This provides a good tradeoff between accuracy and runtime.