Back to search

IKTPLUSS-IKT og digital innovasjon

Function-driven Data Learning in High Dimension

Alternative title: Function-driven Data Learning in High Dimension

Awarded: NOK 7.2 mill.

High-dimensional data today plays an increasingly important role in most computer-based measurements and analyses of real-life problems. This development is due to several technological advances, such as in increased physical measurements and storage capabilities, and multiphysics computer simulations of complex phenomena. Analyses of such data can reveal complex interactions between entities in a process in Nature, or in man-made environments such as Instagram or a Google search. At the same time we are struggling with extracting useful information and derive predictive models. Data-driven modelling is an emerging and challenging area in applied mathematics with an enormous potential, especially if successfully combined with other branches of science, like computer science, engineering, or biomedical computing. Motivated by the increased demand of robust and predictive methods, we developed advanced mathematical methods for robust automatic learning of functions and data structures in high dimension from the minimal number of observed samples. By obtaining lower-dimensional representations of the data and their geometry, we are able to perform the approximation with significantly reduced complexity under realistic assumptions. In the first part of the project, we studied single index models (SIMs) and their nonlinear generalisation, simple yet flexible semi-parametric models for machine learning, where the response variable is modeled as a monotonic function of a linear combination of features. Deploying various mathematical techniques, consisting of theory of inverse and ill-posed problems, learning and approximation theories, differential geometry, harmonic analysis, and sparse optimization, we developed a rigorous theoretical analysis and efficient algorithm for learning SIM and its nonlinear generalisation in high-dimensional regime. The numerical results on real-life datasets show superior performance of our algorithm compared to the state-of-the-art models. Since shallow and deep networks are essentially sums and compositions of functions that follow the single index model, we utilised our theoretical results on SIM to obtain a better understanding on the identifiability and global minimization of neural networks. As a part of the FunDaHD project, we provided novel insights into the mathematics of deep learning. We have presented the obtained results at the renowned international conferences in machine learning and published at high-impact journals.

We designed novel and generic model-based approaches for learning functions in high dimensions from the minimal number of observed samples. We consider different classes of models, ranging from simple generalized linear models to complex neural networks. The resulting approaches are 1. statistically efficient 2. computationally efficient 3. theoretically sound 4. numerically viable Our results are presented in 1 PhD thesis, 12 research articles, 4 refereed conference proceedings, and 3 book chapters, and in numerous (more than 30) scientific presentations. The FunDaHd results have the potential to impact several scientific and technological disciplines. Our results provide a solid foundation and tools for obtaining a better mathematical understanding of neural networks. Beyond fundamental mathematical investigations, our goal is to apply our estimators on various types of big data applications such as cardiac modeling and analysis of brain activities.

Technological advances, such as in physical measurements, computer simulations, and storage capabilities led to vast amounts of often highly complex data sets, and this continues to grow rapidly. Despite substantial R&D to develop tools for complex data analysis, in many cases our understanding on how to extract useful information and predictive models still remains rather limited. Motivated by the increased demand of robust predictive methods, in this project we develop analysis techniques and numerical methods to explore new applications in tractable and robust automatic learning of functions and data structures in high dimension from the minimal number of observed samples. The approach we propose is to obtain lower-dimensional representations of the data and their geometry, to perform the approximation with significantly reduced complexity, assuming the data clustered around manifolds. The key innovative assumption for us is that the underlying manifold, not only possesses lower dimension, but its tangent spaces are also spanned by relatively sparse principal directions. Additionally we consider the learning of the manifold as guided by the function acting on the data. Hence, our approach differs from established methods on manifold learning and establishes a novel connection between manifold and function learning in high dimension, leading to development of more robust algorithms under less restrictive assumptions. Eventually, as the most ambitious part of the project, we will address the learning from high-dimensional multi-manifold data. To demonstrate the performance and robustness of the constructed algorithms, a variety of experiments and numerical tests will be performed throughout the project with real and synthetic data. We will also address several problems in computational biomedicine, such as in cardiac modeling, and in bioinformatics, such as gene expressions. This project will also contribute to strengthen the profile of Simula in big data analysis.

Funding scheme:

IKTPLUSS-IKT og digital innovasjon