How are Data Science Pipelines Designed in Theory and Practice? ISU Research Explains.

March 22, 2022

Data science processes are becoming an integral component of many software systems today. In data-driven software, the processes are organized in several stages such as data acquisition, data preprocessing, modeling, training, evaluation, prediction, and so on, where the data flow from one stage to another. The stages with different subtasks, their connections, and feedback loops, create a new kind of software architecture called Data Science Pipeline. In order to design and build software systems with data science stages effectively, we must understand the structure of the data science pipelines. Iowa State researchers - Sumon Biswas, Mohammad Wardat, and Hridesh Rajan - demonstrated the importance of standardization and analysis framework for data science pipelines. They took the first step to understanding the architecture and patterns of data science pipelines from theory and in practice. 

Biswas, Wardat, and Rajan conducted a three-pronged study to draw observations from pipelines in the literature and popular press, smaller data science tasks, and large projects. They investigated the representation of the pipeline structure, its organization, and characteristics. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics. The study resulted in three representative pipeline structures. The work also informs the terminology and design criteria for pipelines. For example, a number of stages from theory are absent in the pipelines in small data science programs without a clear separation of stages. On the other hand, the pipelines in large data science projects develop complex pipelines with feedback loops and sub-pipelines. The stage boundaries are stricter in large projects, which is necessary for scalability, maintenance, and testing of pipelines. The results will facilitate pipeline architects, practitioners, and software engineering teams to compare with existing and representative pipelines. For instance, a data scientist can identify whether the pipeline is missing any important stage or feedback loops in an earlier stage of development lifecycle, which will save much time and effort.

The paper’s abstract is as follows:

Increasingly larger numbers of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

 

Biswas and Rajan will present the results of the paper entitled “The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large”, in the research track of the 44th International Conference on Software Engineering (ICSE 2022), to be held in Pittsburgh, PA, USA from May 21-29, 2022.ICSE is the premier software engineering conference. Since 1975, the ICSE provides a forum where researchers, practitioners, and educators gather together to present and discuss the most recent innovations, trends, experiences and issues in the field of software engineering.

The project has been supported in part by the National Science Foundation TRIPODS Initiative at ISU called the D4 (Dependable Data-Driven Discovery) Institute. The D4 Institute has a broader goal of increasing dependability in data science pipelines by addressing various critical properties of pipelines such as fairness, complexity, uncertainty, and more. And as Biswas said, “To better understand a certain property in different stages and inform design decisions, it is necessary to study the architectural patterns of the data science pipelines in the first place”. The authors have also expressed their excitement on the acceptance of the paper in a top venue of the field. They are continuing the research to explore new avenues and assure specific properties of data science pipelines. The preprint of the paper is available at the Laboratory of Software Design website: https://design.cs.iastate.edu/papers/ICSE-22a.

 


Sumon Biswas

Sumon Biswas

Mohammad Wardat

Mohammad Wardat

Hridesh Rajan

Hridesh Rajan

News Type: 
Category: