We consider data driven artificial intelligence over long sequences. Many real-world data are intrinsically sequential, for example, text, speech, music, time series, DNA sequences and unfolding of events. However, conventional data science methods can process only short sequences up to a few thousand steps. In this project we develop a scalable method which enables efficient and accurate inference for very long sequences up to millions or even billions of steps. At the end of the project, we will deliver theoretical breakthroughs such as new models with guarantees, as well as practical outcomes such as computer software and visualization tools. Our research findings will be applied to two focus areas: 1) microbiology and infectious disease epidemiology and 2) remote sensing pattern recognition. Moreover, because long sequential data are commonly available in many areas, our method can be applied as a critical component in a wide range of tasks including scientific research, medical and health service, natural language processing, financial data analysis, market studies, etc.
During 01.10.2021 - 30.11.2022, we have achieved the following:
- Both planned PhD students have been studying and working well.
- A research visit to DTU was in May 2022. A conference trip to NeuriPS will be in December 2022.
- The neural network architecture has been further enhanced, with three variants published in Level-2 conferences or journals.
- Our methods have achieved substantial improvement over previous machine learning approaches in long document classification, DNA-based taxonomy classification, genetic variant prediction, and gene expression analysis.
- From the start of the project, 19 relevant papers have been accepted or published. Three papers are in submission to ICLR 2023, CVPR 2023, and Bioinformatics, respectively.
In the past decade Machine Learning (ML), especially deep learning, has brought us many successful data-driven AI applications. Many real-world data are intrinsically sequential, for example, text, speech, music, time series, DNA sequences and unfolding of events. However, conventional deep learning methods can process only short sequences up to a few thousand steps. The existing approaches often face challenges like slow inference, vanishing (and exploding) gradients and difficulties in capturing long-term dependencies. In this project we develop a scalable machine learning method which enables efficient and accurate inference for very long sequences up to millions or even billions of steps. At the end of the project, we will deliver a versatile ML framework based on deep neural networks, as well as its efficient optimization algorithms, computer software, and visualization tools. Our research findings will be applied to two focus areas: 1) microbiology and infectious disease epidemiology and 2) remote sensing pattern recognition. Moreover, because long sequential data are commonly available in many areas, our method can be applied as a critical component in a wide range of tasks including scientific research, next-generation DNA sequence analysis, natural language processing, financial data analysis, market studies, etc.