Revolutionizing Protein Design

Latest News

In the fast-evolving world of biotechnology and medicine, the ability to design proteins with novel functions stands as a pivotal advancement. This capability promises breakthroughs across a wide array of applications, including the development of new therapeutics, vaccines, and industrial enzymes. Traditional methods for protein design, such as the widely used Rosetta, have approached this challenge through a physics-based lens, treating it as an optimization problem. These methods seek amino acid sequences that minimize an energy function to ensure proper folding and function. However, their practical use is hindered by limitations in computational efficiency and the accuracy of their underlying energy models.

Recently, the landscape of protein design has been transformed by the advent of deep learning approaches. These methods, which learn mappings from structure to sequence, have started to outperform traditional physics-based techniques. Generative models like ProteinMPNN have shown remarkable success in various tasks, including designing protein binders, assemblies, diversified enzymes, and conformational switches. Despite their achievements, these models face a significant bottleneck: the limited quantity and diversity of experimentally determined protein structures available for training.

To mitigate this limitation, researchers have turned to predictions from advanced tools like AlphaFold2 to supplement the training data. This integration has improved the performance of protein design models but still struggles with the finite number of structures that can be accurately predicted. Additionally, the reliance on predicted structures introduces its own set of constraints.

An alternative and promising approach involves protein language models. These models, trained on vast datasets of protein sequences, utilize self-supervised learning objectives to capture intricate sequence-function relationships. By predicting masked residues or the next residue in a sequence, these models can infer properties related to both structure and function. However, guiding these models to generate specific desired sequences remains a challenge. This is typically achieved through fine-tuning on curated datasets specific to certain protein families or functions. While successful, this method is limited to known protein families and doesn’t easily incorporate precise atomistic information into the design process.

In a groundbreaking development, researchers have introduced ProseLM (protein structure-encoded language model), a novel method that enhances protein language models by explicitly incorporating structural and functional information. ProseLM integrates details about the desired protein backbone and its context, such as nearby proteins, nucleic acids, ligands, and ions. This method leverages parameter-efficient adaptations, allowing the model to benefit from the scaling trends of underlying language models and enabling high rates of native sequence recovery.

Access the pre-print article here.

Events & Webinars