Back to search

BIA-Brukerstyrt innovasjonsarena

Iris.ai - the AI Chemist

Alternative title: Iris.ai - Kjemikeren av kunstig intelligens

Awarded: NOK 9.0 mill.

In order to achieve the overall AI Chemist project goal, Iris.ai has undertaken the following research projects: 1. Domain-specific word embeddings Domain adaptation of embedding models is a proven technique for domains that have insufficient data to train an effective model from scratch. Chemistry is one such domain, where scientific jargon and overloaded terminology inhibit the performance of a general language model. In this project, we have researched two techniques: 1.a. Spherical embeddings The recently proposed spherical embedding model (JoSE) jointly learns word and document embeddings during training on a multi-dimensional unit sphere, which performs well for document classification and word correlation tasks. But, we show a non-convergence caused by global rotations during its training prevents it from domain adaptation. We have developed methods to counter the global rotation of the embedding space and propose strategies to update words and documents during domain specific training. In this work, we show that our strategies can reduce the performance cost of domain adaptation to a level similar to Word2Vec. 1.b. Latent semantic imputation (LSI) Reliable word embeddings require large amounts of textual data. For highly specialized domains such as chemistry, the most significant entities can be very rare. Often we do have much information about these entities which is not in textual form but in the form of measured physical or chemical properties. Latent Semantic Imputation (LSI) is an approach that enhances a generic word embedding matrix with external information from textual or non textual domain data. We have shown that LSI is applicable to scientific text where problems with rare and novel words are particularly acute, and that it can also work with relational domain data thus opening up a broader range of data sources. This shows that LSI is a suitable methodology for controlled updates and improvements of scientific word embedding models based on domain-specific knowledge graphs. 2. Embeddings evaluation framework Evaluations for generic word embeddings have received a lot of attention in the NLP community. Generally, embedding evaluation tasks are categorized in the literature as either intrinsic or extrinsic. Intrinsic evaluations probe the geometrical structure of the generated embedding vector space and do not need anything beyond the raw word vectors. Extrinsic evaluations are those in which word embeddings are used as an input layer to a task-specific machine learning (ML) model. This project aimed to develop a suite of transferable intrinsic and extrinsic tasks for domain-specific word-embedding evaluations which can be applied to chemistry-specific evaluations. We combine the ideas of an extrinsic test suite, VecEval, and an intrinsic test suite, LDT toolkit, to design an automated pipeline for evaluating embeddings using various intrinsic and extrinsic evaluation tasks. Our current progress includes implementations of semantic partitioning as one of the intrinsic tasks, and named-entity-recognition (NER) and document classification as part of the extrinsic tasks. 3. Knowledge graph building To identify novel applications for existing compounds from millions of research papers, it is crucial to build a knowledge graph that helps navigate these publications. To do so, we have collaborated (1) with the CORE research group at the Open University (OU) to determine the types of citations used in the literature; and (2) with the KnowLab at the University Colleges of London (UCL) to understand how to enrich a human-annotated ontology with word embeddings. 3.a.Citation-typing (OU) With the CORE team, we conducted a survey of citation typing approaches and used these to motivate the focus areas in our methodology. We selected the dataset on which our future experiments will be conducted and we ran a shared task with over 20 international participating teams to establish the baseline against which we will be measuring our progress. Effectively, this is a solid approach to establishing a baseline that is not trivial, but rather demonstratively state-of-the-art. Our work also highlights which machine learning models tend to be successful on these tasks which will be reflected in our future collaborative work. 3.b. Ontology enrichment (UCL) In this project, we combined the in-house domain-specific word-embeddings (DSWE) with a given simple ontology created by domain experts. The given ontology is expected to contain key entities, which define key concepts and relations in a specific domain. Through named entity recognition and disambiguation and with the aid of the DSWE, we have shown how to to enrich and expand the simple ontology by injecting contextual information of entities identified in text.

The research undertaken in the Iris.ai the AI Chemist project has enabled us to advance our work towards the "AI Researcher". This has opened up a set of brand new market opportunities for us, from research institutes to corporate R&D and even publishing houses. The direct results of the AI/ML research performed has for example enabler our commercial collaboration with not-for-profit Materiom: Their goal is to make publicly available a database of material data (ingredients, recipe and ensuing properties) from over 50,000 research papers. Our table extraction, named entity recignition and other ML models have enabled us to extract, systematize, link and populate this database automatically. The database will be used by researchers aiming to find non-petrochemical alternatives to their use cases. Materiom is one of many clients we are already undertaking these projects with - projects that have commercial and environmental value. Thus, the expected effects of the research project has already been proven, and the potential effects continue to remain major, as described in the initial application: chemicals and materials are widely used in our daily life: in our homes, cars, electronics, food, medications - in fact, in about 95 % of all goods we consume or use. Offering tools for the chemical and material science industry to develop sustainable materials, better battery technology, food for everyone will be a vital impact of this project.

Iris.ai is building a set of innovative artificial intelligence-tools for chemical research. Using the latest breakthroughs in text understanding these tools will allow researchers to automatically do what today not only needs to be done manually, but often is so tedious and time consuming it can not be done. These tasks include identifying novel application areas for existing compounds from scanning millions of research papers and patents, both finding applications that are described directly and finding applications that can be inferred from several sources. The key R&D challenge to achieve this is to develop an artificial intelligence algorithmic core engine within natural language understanding, mainly concerning understanding similarity, compositionality, causality and ranking metrics. More specifically, the research challenges for this project is to research and develop domain-specific knowledge discovery with context aware word-embeddings as well as domain specific entity embeddings. The engine should be able to build unique representation of the provided chemical element, link it to existing written knowledge available (patents, science articles, etc.) about the element or similar such elements and finally organize that knowledge into application areas and presented it to the user. Additionally we will extend the functionality of that engine to be able to infer application areas that are not explicitly derived from the literature, but are linked based on linkages in between connected elements in the body of knowledge. We will verify those objectives in close collaboration with clients from the Chemical Industry, which will provide us with an Ontology of interest, and examples from their day-to-day work. We will also use available public open access repositories of Chemistry related textual information and elements, molecules and compounds registries and databases for validating the embeddings space.

Funding scheme:

BIA-Brukerstyrt innovasjonsarena