The assessment of the health of ecosystems is of great concern to gain insight into the impact of human activities. The health of an ecosystem, such as the aquatic ecosystem, can be considered the sum of the impact on organisms occupying this ecosystem. One of the largest concerns is the release of and adverse impact of chemicals (pollutants) on these organisms. A combination of measuring the concentrations of the pollutant itself in different matrices (e.g. water, sediment, air or the organism itself), and comparing it to thresholds where no effects are expected to occur, are often used to assess the potential for adverse impact of real life exposure scenarios. These safe thresholds are identified on basis of controlled laboratory studies with animals, and raise substantial controversy about the benefit compared to substantial ethical concern and monetary cost. The international research community is therefore taking measures to reduce, refine and replace such tests. Development of prediction methods such as QSAR models and "Read-Across" approaches are therefore increasingly developed and used for defining safe thresholds. Quantitative structure-activity relationship (QSAR) methods predict the potential toxicity to different organisms using the individual chemical properties to determine to which degree and how a chemical is causing toxic impact to a given organism or organism group. Read-Across also uses existing experimental data, but has a wider application potential due to using a larger number of chemicals and species to fill knowledge gaps.
This thesis investigates a hybrid approach of the two methods described above by introducing background knowledge into the modelling methods. This background knowledge takes the form of a knowledge graph which is a collection of facts. These facts are expressing in a way that is both machine and human readable. This knowledge graph contains facts about species, chemicals, and existing laboratory experiments, as well as large amounts of metadata related to these. Knowledge graph are symbolic knowledge which is not ideal for using in modelling methods, such as machine learning, where numerical values are necessary. Therefore, we employ knowledge graph embedding methods. The task of these methods is to turn entities (e.g., a chemical) in the knowledge graph into numerical representations in the form of a vector. These methods take the structure of the knowledge graph into account and tries to preserve it as well as possible in the numerical representation.
Now that the knowledge graph is represented numerically, we can learn relationships between the representations of species and chemicals, and the toxic effect the latter has on the former. We found that by including the background knowledge in this modelling method we were able to increase the prediction accuracy over a method using chemical and taxonomic similarity alone. These modelling methods used are inherently difficult to explain, that is, the relationships that is learned from the data can be very complex and it is impossible to derive these is a simple way. Therefore, we use the knowledge graph in a few ways to increase our understanding of how the model makes predictions. We look at how much data is available in the knowledge graph, and it turns out that the amount of data correlates to the error of each prediction. Albeit not surprising, an interesting result. We also employ the knowledge graph in providing facts which are relevant for the prediction. These facts can be presented to a domain expert which can make conclusions on areas of knowledge that is lacking and needs to be expanded on.
We have demonstrated, using machine learning and knowledge graphs, that large scale, generic models for toxicity is possible to develop. Moreover, large areas of this field remains unknown and further research is needed to increase robustness and explainability of models.
The PhD work has led to the integration of two key disciplines in science, data sciences (informatics) and (eco)toxicology, to advance the use of advanced modelling for hazard assessment. The effort has brought together scientists from Norway and the United Kingdom to perform high level research, developing new areas of expertise and establish a collaborative platform for future initiatives. The PhD work has introduced novel approaches to the research arena, and it may facilitate advancement of the methods within Next Generation Risk Assessment (NGRA). Although the work is currently in its infancy, even within the research arena, future efforts are envisioned to have potential utility also for regulatory sciences and industry to integrate disparate data sources and fill data gaps where such exists (e.g. read-across in regulatory decision making).