Back to search

NAERINGSPH-Nærings-phd

Model based testing to support dynamic generation of complex test data

Alternative title: Modellbasert, dynamisk generering av komplekse testdata

Awarded: NOK 1.6 mill.

Traditionally, software development teams in many industries have used copies of production databases or their masked, anonymized, or obfuscated versions for testing. However, in recent years, regulatory frameworks, such as the General Data Protection Regulation (GDPR), prohibited these practices. In such a situation, there is often a need to generate synthetic but production-like test data, i.e., test data that is statistically representative of the production data and conforms to the domain’s constraints to support software testing activities. In this thesis, we address this need by presenting a novel approach for generating production-like test data using deep learning techniques and studying the practical effectiveness of our proposed approach in industrial settings. We lay out the foundation of the research by conducting a case study with our industrial collaboration partner, the Norwegian National Population Registry (NPR), to identify the test data needs in the cross-organization integration testing between NPR and other organizations in Norwegian public and private sectors. We propose a solution for generating production-like test data that meets the identified test data needs. In this solution, we frame the problem of generating production-like test data as a language modelling problem and utilize deep learning based language modelling techniques to build statistical models from production data; the statistical model is then integrated with downstream components to generate production-like test data. Furthermore, based on the identified test data needs, we propose an evaluation framework to measure the quality of generated data quantitatively. This evaluation framework is also employed to evaluate language model performance and the effectiveness of the whole solution. To evaluate the industrial applicability of our solution, we applied it to the integration testing of our industrial collaboration partner, the NPR. Within the context of NPR, we experimented with three of the most successful Deep Learning algorithms for Language Modeling, namely Recurrent Neural Networks (RNN), Variational Autoencoders, and Generative Adversarial Networks (GANs) to train language models from the NPR production data. We evaluated and compared the performance of language models of the three algorithms using the proposed evaluation framework. The results from our experiments showed that our solution can generate test data that is statistically representative of the production data and conforms to the business rules of the domain. Moreover, the RNN model is able to generate highly syntactically and semantically valid data that are highly representative of the real NPR data; it outperforms the language models from the other two algorithms, To further enhance the solution, we propose an approach for designing a domain specific language (DSL) to achieve higher expressiveness and information capacity. A DSL designed with this approach allows us to better leverage the ability of the deep-learning technology and generate even richer and more production-like test data. Applying this approach to the NPR domain, we experimented with the new DSL and the RNN algorithm to train a new language model from production data and evaluated model performance with the proposed evaluation framework. The experiment results demonstrated that despite its higher information capacity and enhanced expressiveness, the new language model maintained the high performance of the previous model, ensuring continued excellence in generating high- quality data, albeit with a reasonable and affordable increase in computational cost. In conclusion, this thesis presents an innovative solution for production-like test data generation with an active research approach and through large-scale industrial collaboration. It advances the understanding of the research community about the applicability of data-driven, machine learning based approaches as a new technique for generating rich and high-quality test data that can be used for reliable testing of large scale and complex systems. Furthermore, our practical industrial evaluations confirmed the solution’s effectiveness, emphasizing its real-world applicability in complex settings.

Prosjektet introduserer en innovativ metode som utnytter kraften i dyp læring og kunstig intelligens for å skape testdata som speiler virkelige scenarioer uten å kompromittere personvern eller regelverkets etterlevelse. I nært samarbeid med Det norske folkeregisteret, dykket vi ned i de intrikate behovene for tverrorganisatorisk integrasjonstesting. Denne forskningen ledet oss til å foreslå en løsning som revolusjonerer genereringen av produksjonsliknende testdata. Vi formulerte utfordringen som et språkmodelleringsproblem og brukte dyp læringsteknikker for å bygge statistiske modeller fra produksjonsdata. Disse modellene ble deretter integrert i vår løsning, og genererte testdata som ikke bare møtte de komplekse begrensningene i domenet, men som også gjennomgikk en grundig evaluering for kvalitet og etterlevelse. For å validere vår tilnærming, satte vi den på prøve i den virkelige verden hos Folkeregisteret, og eksperimenterte med avanserte dyp læring algoritmer. Våre resultater viser at vår løsning er i stand til å produsere data som ikke bare reflekterer statistisk nøyaktighet, men som også matcher de intrikate reglene i Folkeregisterets domene. Videre foreslo vi en forbedret metode ved bruk av et domenespesifikt språk (DSL), som tillater enda rikere og mer uttrykksfull datagenerering. Til tross for økt kompleksitet, viste våre eksperimenter at denne nye modellen opprettholdt høy ytelse, og sikret datagenerering av topp kvalitet til en rimelig beregningskostnad. Avslutningsvis, denne avhandlingen presenterer en innovativ løsning for å generere realistiske testdata ved bruk av dyp læring med praktiske anvendelser som opprettholder personvernregelverket. Ved å utvide grensene for datadrevne maskinlæringsmetoder, har vi åpnet nye dører for pålitelig testing av store og komplekse systemer i diverse bransjer. Vår suksess i industrielle innstillinger bekrefter den praktiske gjennomførbarheten og effektiviteten av denne løsningen.

Based on our current understanding of state of practice from our consultancy on software testing in the public sector, we have seen a tremendous need for better test data in order to support more systematic and complete software testing. Current many companies are using production data for testing purposes, however, this is problematic both because that this is illegal (according to EU regulations), and because that it does not support automated testing. Challenges: Our hypothesis is that current tools and methods for model based test data generation have in large part developed on the basis of small toy examples and thus can not scale up to meet the practical need for software testing in large scale complex software systems. This hypothesis will be verified on the basis of our systematic review of the current state of the art, combined with our observatory case studies on both the current state of practice and actual needs identified in the public sector. Our main challenge is thus to propose practical solutions to the very hard problem of generating sufficiently complex test data that will meed the actual requirements of our case study partners. Our solution approach is to iterative prototyping combined with cost benefit analysis of applying the prototypes on testing challenges faced by our case study partners. Use of findings: This aim of this project is to provide increased knowledge and prototype tools that can be further developed into industry strength tools that will help the industry and public sector to obtain sufficiently rich test data.

Funding scheme:

NAERINGSPH-Nærings-phd