Massive amounts of data are being generated especially with the rise of Internet of Things (IoT) technologies creating new value creation opportunities through Big Data analysis. Accordingly, Big Data analysis has been a driving factor in revolutionizing major sectors, such as mobile services, finance, and scientific research. Big Data pipelines are composed of multiple orchestrated steps or activities that perform various data analytical tasks. They are different from business and scientific workflows since they are dynamic, process heterogeneous data, and are executed in parallel instead of a sequential set of scientific operators. Although many organizations recognize the significance of Big Data analysis, they still face critical challenges when implementing data analytics into their process. Firstly, multiple experts, ranging from technical to domain experts, need to be involved in specifying such complex pipelines. Secondly, given the fact that IoT, Edge and Cloud technologies converge towards a computing continuum, pipeline steps need to be mapped dynamically to heterogeneous computing and storage resources to ensure scalability. Providing a scalable, general-purpose solution for Big Data pipelines that a broad audience can use is an open research issue. The challenges in devising an applicable generalized solution come from the fact that bottlenecks can occur on an individual pipeline step level - for example, when the throughput of one step is lower than the others. Thus, scaling up the entire pipeline does not address the scalability issues and needs to be done on the individual step level. This issue becomes worse by the fact that scalability needs to be organized and orchestrated over heterogeneous computing resources. Furthermore, scaling up individual steps introduces race conditions between step instances that attempt to process the same piece of data simultaneously. Another major challenge is achieving usability by multiple stakeholders as most Big Data processing solutions are focused on ad-hoc processing models that only trained professionals can use. However, organizations typically operate on specific software stacks, and getting experts in Big Data technology can introduce costs that are not affordable or practical. Even if an organization has the necessary technical personnel, data pipeline steps pertain to specific domain-dependent knowledge, which is possessed by the domain experts rather than the data scientists who set up the data pipelines.
The PhD thesis aims to develop approaches and techniques that will allow lowering the technological barriers of entry to the incorporation and implementation of Big Data pipelines, thus making them accessible to a wider set of stakeholders regardless of the hardware infrastructure.
The PhD project researches and develops novel methods to support the lifecycle of Big Data pipelines processing, enabling their definition, model-based analysis and optimization, simulation, and deployment on top of decentralized heterogeneous infrastructures on the Cloud/Fog/Edge Continuum. To meet this aim, the thesis will create a new domain-specific language (DSL), methods, infrastructures, and software prototypes for managing Big Data pipelines such that Big Data pipelines can be easily set up in a manner which is trace-able, manageable, analyzable and optimizable and will separate the design- from the run-time aspects of their deployment.