STIPINST-Stipendiatstillinger i instituttsektoren

Massive amounts of data are being generated especially with the rise of Internet of Things (IoT) technologies creating new value creation opportunities through Big Data analysis. Accordingly, Big Data analysis has been a driving factor in revolutionizing major sectors, such as mobile services, finance, and scientific research. Big Data pipelines are composed of multiple orchestrated steps or activities that perform various data analytical tasks. They are different from business and scientific workflows since they are dynamic, process heterogeneous data, and are executed in parallel instead of a sequential set of scientific operators. Although many organizations recognize the significance of Big Data analysis, they still face critical challenges when implementing data analytics into their process. Firstly, multiple experts, ranging from technical to domain experts, need to be involved in specifying such complex pipelines. Secondly, given the fact that IoT, Edge and Cloud technologies converge towards a computing continuum, pipeline steps need to be mapped dynamically to heterogeneous computing and storage resources to ensure scalability. Providing a scalable, general-purpose solution for Big Data pipelines that a broad audience can use is an open research issue. The challenges in devising an applicable generalized solution come from the fact that bottlenecks can occur on an individual pipeline step level - for example, when the throughput of one step is lower than the others. Thus, scaling up the entire pipeline does not address the scalability issues and needs to be done on the individual step level. This issue becomes worse by the fact that scalability needs to be organized and orchestrated over heterogeneous computing resources. Furthermore, scaling up individual steps introduces race conditions between step instances that attempt to process the same piece of data simultaneously. Another major challenge is achieving usability by multiple stakeholders as most Big Data processing solutions are focused on ad-hoc processing models that only trained professionals can use. However, organizations typically operate on specific software stacks, and getting experts in Big Data technology can introduce costs that are not affordable or practical. Even if an organization has the necessary technical personnel, data pipeline steps pertain to specific domain-dependent knowledge, which is possessed by the domain experts rather than the data scientists who set up the data pipelines. The PhD thesis aims to develop approaches and techniques that will allow lowering the technological barriers of entry to the incorporation and implementation of Big Data pipelines, thus making them accessible to a wider set of stakeholders regardless of the hardware infrastructure.

Publications from Cristin

Funding scheme:

STIPINST-Stipendiatstillinger i instituttsektoren

359.2MILL. NOKtotal funding in the programme period 87PROJECTShave received funding in the programme period 1SOURCEhas financed the programme

Funding Sources

Kunnskapsdepartement

Thematic Areas and Topics

Politikk- og forvaltningsområder Læring, skole og utdanning Politikk- og forvaltningsområder Grunnforskning LTP3 Fagmiljøer og talenter Anvendt forskning Portefølje Forskningssystemet Delportefølje Kvalitet Portefølje Banebrytende forskning LTP3 Høy kvalitet og tilgjengelighet

STIPINST-Stipendiatstillinger i instituttsektoren

Stipendiatstilling 4 SINTEF (2021-2023)

Awarded: NOK 4.2 mill.

Popular Science Description

Achieved effects

Publications from Cristin

Container-Based Data Pipelines on the Computing Continuum for Remote Patient Monitoring

The Data Value Quest: A Holistic Semantic Approach at Bosch

SIM-PIPE DryRunner: An approach for testing container-based big data pipelines and generating simulation data

Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers

Locality-Aware Workflow Orchestration for Big Data

Big data workflows: Locality-aware orchestration using software containers

Big Data Pipelines on the Computing Continuum: Ecosystem and Use Cases Overview

Data Enrichment and Data Pipelines

Flexible Deployment of Big Data Pipelines on the Cloud/Edge/Fog Continuum

Big Data Pipelines in DataCloud

Big Data Pipelines on the Computing Continuum

Big Data Pipelines on the Computing Continuum

Flexible Deployment of Big Data Pipelines on the Computing Continuum

SIM-PIPE DryRunner: A tool for simulation/testing of container-based big data pipelines

Framework for Big Data Pipelines using Container Technology

Funding scheme:

STIPINST-Stipendiatstillinger i instituttsektoren

Funding Sources

Thematic Areas and Topics