Catalysis is a fundamental technology required to solve many of the most important challenges that society will face in this century. Examples include hydrogen production from water and sunlight, CO2 recycling into valuable products, and the clean production of chemicals. Catalysts accelerate these processes, reducing both the energy required and the waste generated. There is a very large amount of chemical compounds that can potentially catalyze chemical reactions of interest; however, only a few will be active, selective, and robust at the high levels required in industrial applications. The catLEGOS project will tackle the problem of finding optimal catalysts with a new approach combining quantum mechanics with artificial intelligence methods.
catLEGOS will generate large and complex data by means of calculations based on the principles of quantum mechanics. This data will be used to build predictive models with machine learning methods. The methods used will include deep neural networks, which are inspired by the biological structure of the nervous tissue, and Gaussian processes, which exploit probability theory. The neural networks will enable the fast and accurate screening of potential catalysts, identifying the molecular fragments (LEGOS) that compose them. The combinations of these fragments in new catalysts will be further investigated with the Gaussian processes.
catLEGOS will also develop new mathematical representations for machine learning applied to catalysis, focusing on their physical and chemical meaning. These representations will provide means of explaining the predictions of the models. The explanations will be used to construct rational design models for the development of new catalysts. To achieve these goals, catLEGOS will follow an interdisciplinary approach combining computational chemistry with elements of statistics theory and informatics.
In a recent study, we showed that Gaussian processes (GP) can be trained with DFT data for predicting the energy barrier of fundamental reactions in homogeneous catalysis (Balcells et al., Chem. Sci., 2020, 11, 4584). The key advantage of these models is that they achieve high accuracy (MAE of ca. 1 kcal/mol) with small training datasets. The catLEGOS project will take this approach to the next level by developing a recommender system for catalysis based on deep neural networks (DNNRc). The DNNRc will enable catalyst discovery by defining the chemical subspaces explored by the GP, which are otherwise arbitrary. The subspaces will be built with active metal and ligand fragments (molecular Legos) provided by the DNNRc. The catLEGOS project will also expand the tmQM dataset (Balcells et al., J. Chem. Inf. Model., 2020, 60, 6135), adding thermodynamics parameters for ~100k transition metal complexes, and the mNBOg graph, a novel multilayer graph representation based on natural bond orbital analysis. Both deliverables will be used in the development of the DNNRc and GP models, which will be tested in the discovery of catalysts for the water oxidation and CO2 reduction reactions.