Chemical language model for identifying functionally similar molecules across structural variants

A chemical language model identifies functionally similar molecules to a query by using SMILES strings and different canonicalization algorithms. This method finds structurally distinct but functionally analogous molecules, aiding drug discovery by focusing on functional rather than structural similarities.

Background

Chemical similarity searches are a fundamental tool in drug discovery, allow­ing researchers to identify potential drug-like molecules by comparing their structures to known compounds. Traditionally, these searches have relied on 2D and 3D structure-based comparisons, which focus on identifying molecules with high structural similarity. However, this approach can be limiting, as structurally distinct molecules can exhibit similar functionalities.

The advent of machine learning and chemical language models has opened new avenues for exploring molecular similarity beyond structural features. These models, trained on vast datasets of chemical structures, can capture complex biochemical properties and interactions, potentially enabling the identification of functionally analogous molecules that traditional methods might overlook.

Despite the promise of chemical language models, current approaches face significant challenges. These models often depend heavily on the canonicali­zation of SMILES strings, a linear representation of molecules, to standardize input data. This reliance on a single canonicalization method can lead to inconsistencies and limit the model’s ability to identify functionally similar but structurally diverse molecules.

Additionally, many machine learning-based searches still prioritize structural similarity, which can result in the discovery of molecules that are merely structural derivatives of the query. This focus on structural similarity can hinder the identification of novel structural classes with desired functionalities. Furthermore, the computational complexity and resource requirements of these models can pose practical limitations, particularly when dealing with large chemical databases. As a result, there is a need for innovative approaches that can effectively leverage chemical language models to explore functional similarities in chemical space.

Technology description

The technology employs a chemical language model to identify molecules with functional similarities to a query molecule, even if they are structurally different. This is accomplished by encoding molecules using SMILES strings and leveraging a chemical language model, such as ChemBERTa, to create feature vectors.

The system operates in two modes: standard and discovery. In the standard mode, the same canonicalization algorithm is used for both the query and the database, while in discovery mode, different canonicalization algorithms are applied.

The discovery mode is particularly adept at finding functionally analogous molecules that traditional structure-based searches might overlook. By altering the SMILES representation, it emphasizes functional similarities over structural ones. This approach has been tested on various molecules, including drug-like and dye-like compounds, showing potential in identifying novel structural classes with desired functionalities, aiding drug discovery and repurposing.

This technology is differentiated by its ability to bypass traditional reliance on structural similarity, focusing instead on functional similarities through the innovative use of SMILES canonicalization. By employing different canonicalization algorithms for the query and database, the discovery mode reduces the dependence on structural similarity, allowing the model to identify functionally similar molecules that may be structurally distinct. This represents a significant advancement over existing methods, which typically focus on structural homology.

The approach leverages the chemical language model’s ability to learn and represent complex biochemical properties, enabling the identification of molecules with similar functionalities even when they are not structurally similar. This capability is particularly valuable in drug discovery, where finding novel compounds with desired biological activities is crucial.

Benefits

  • Identifies functionally similar molecules that are structurally distinct
  • Aids in discovering novel structural classes of molecules
  • Utilizes chemical language models for enhanced molecular similarity searches
  • Reduces reliance on structural similarity, focusing on functional similarities
  • Potentially accelerates drug discovery and repurposing efforts
  • Enables identification of molecules missed by traditional structure-based searches
  • Provides a computationally efficient method for large-scale chemical searches
  • Supports discovery of new drug-like and dye-like compounds
  • Facilitates exploration of chemical space beyond traditional methods
  • Offers a novel approach to chemical similarity search using SMILES strings

Commercial applications

  • Drug discovery and repurposing
  • Functional molecule identification
  • Novel structural class discovery

Publication

https://linkinghub.elsevier.com/retrieve/pii/S2666389923002490?showall=true