Molecules packed in a different arrangement, although having similar chemical compositions are called polymorphs. Researchers at GlaxoSmithKline (GSK) and the Cambridge Crystallographic Data Centre (CCDC) released a research paper. The researchers amalgamated their respective datasets to better train machine learning (ML) models so that they can find stable polymorphs to use in new drug candidates.
The two datasets, from GSK and CCSD are distinct. Since the past century, scientists globally have contributed their research publications related to experimental crystal structures therefore now, the CCSD has over 1.1 million structures. GSK’s structures were collected at different stages of the pharmaceutical process which is not only limited to marketed products. The authors used a subset from the CCSD and combined it with GSK’s structures’ subset. Co-author Dr Jason Cole, senior research fellow on CCDC’s research and development team, explained why structures gathered at different stages of the drug discovery pipeline are so important in a press release originally published on the Cambridge Crystallographic Data Centre website.
“In early-stage drug discovery, a crystal structure can help to rationalize conformational effects, for example, or characterize the chemistry of a new chemical entity where other techniques have led to ambiguity,” Cole said. “Later in the process, when a new chemical entity is studied as a candidate molecule, crystal structures are critical as they inform form selection and can later aid in overcoming formulation and tabletting issues.”
This information can help researchers prioritize their efforts—saving time and potentially lives down the road.
“By understanding a range of crystal structures, scientists can also assess the risk of a given form being long-term unstable,” Cole said. “A full characterization of the structural landscape leads to confidence in taking a form forward.”
“You will only find co-crystals if you look for co-crystals,” Cole said, as an example. “Most companies prefer to formulate a free, or unbound, drug. One can assume that the types of structures in an industrial set reflect conscious decisions to search for forms of given types, whereas fewer bounds are placed on the researchers who contribute to the CSD.”
“Large amounts of data lead to more confident predictions,” Cole said. “Data that are most directly relevant to the problem lead to more accurate predictions. In the predictions that use CCDC software, we select a subset of the most relevant entries that is large enough to give confidence. The GSK set is bound to have highly relevant compounds to other compounds in their commercial portfolio. So the model-building software can use these.”
“Consider that CSD software typically picks around 2,000 structures from the 1.1 million in the CSD,” Cole said. “The industrial set is tiny by comparison, but you could pick, say, 40 or 50 highly relevant structures. You’d have insufficient data to build a good model with that alone, but the added compounds from the CSD supplement the data set. In essence, by including the GSK and CSD sets we get the best of both worlds: all the highly relevant industrial structures and a set of quite relevant CSD structures together to build a high-quality model.”
Source: indiaai.gov.in