Exploring the Promise and Pitfalls of DEL + ML for Drug Discovery

Latest News

Can machine learning (ML) and DNA-encoded libraries (DEL) work together to uncover new drug-like compounds more efficiently? A recent study presents a nuanced look at the potential and limitations of combining DEL screening with ML to identify novel small molecule binders to protein targets.

DELs are massive collections of small molecules, each tagged with DNA to track what binds to a target protein. They’re great for generating data, which makes them a natural match for ML. This study compared three different DELs, one with 1 billion compounds and two smaller ones (~10 million each) and trained five ML models on them to find compounds that bind to two protein targets (CK1α and CK1δ). The surprising takeaway? Bigger isn’t always better.

The most diverse library (HitGen’s 1B compound set) consistently helped ML models perform better, not just within the data they were trained on, but also when predicting new binders. Another smaller library (DOS-DEL) was also diverse and did okay, but a less diverse 10M-compound library lagged behind.

Among the five ML models tested, deep learning approaches (MLP and ChemProp) outperformed older methods like random forests and SVMs. ChemProp was especially good at staying close to known safe chemical spaces, while MLP explored more diverse regions.

Still, hit confirmation rates were modest. About 10% of predicted compounds turned out to be real binders, most of them weak. But two nanomolar hits were discovered, showing that the pipeline can deliver high-quality leads.

This study makes a strong case that chemical diversity matters more than library size when training ML for drug discovery. It also confirms that deep learning is a better bet than older ML approaches.

The team made their best-performing models and training data freely available on GitHub, so others can build on their work.

Events & Webinars