About the D4 Institute
Data-driven discoveries are permeating critical fabrics of society. Unreliable discoveries lead to decisions that can have far-reaching and catastrophic consequences on society, defense, and the individual. Thus, the dependability of data-science lifecycles that produce discoveries and decisions is a critical issue that requires a new holistic view and formal foundations. This project, the Dependable Data Driven Discovery (D4) Institute at Iowa State University, will advance foundational research on ensuring that data-driven discoveries are of high quality. The activities of the D4 Institute will have a transformative impact on the dependability of data-science lifecycles.
First, the problem definition itself will have a significant impact by helping future innovations beyond academia. While the notion of dependability is well-studied in the computer-systems literature, challenges in data science push the boundary of existing knowledge into the unknown. This institute's work will define D4, and increase data science's benefit to society by providing a transformative theory of D4. The second impact will come from the process of shared vocabulary development facilitated by this institute, and its result that would encourage experts across TRIPODS disciplines and domain experts to collaborate on common goals and challenges. Third, the institute will set research directions for D4 by providing funding for foundational research, which will have a separate set of impacts. Fourth, the institute will facilitate transdisciplinary training of a diverse cadre of data scientists through activities such as the Midwest Big Data Summer School and the D4 workshop.
The project will advance the theoretical foundations of data science by fostering foundational research to enable understanding of the risks to the dependability of data-science lifecycles, to formalize the rigorous mathematical basis of the measures of dependability for data science lifecycles, and to identify mechanisms to create dependable data-science lifecycles. The project defines a risk to be a cause that can lead to failures in data-driven discovery, and the processes that plan for, acquire, manage, analyze, and infer from data collectively as the data-science lifecycle. For instance, an inference procedure that is significantly expensive can deliver late information to a human operator facing a deadline (complexity as a risk); if the data-science lifecycle provides a recommendation without an uncertainty measure for the recommendation, a human operator has no means to determine whether to trust the recommendation (uncertainty as a risk). Compared to recent works that have focused on fairness, accountability, and trustworthiness issues for machine learning algorithms, this project will take a holistic perspective and consider the entire data-science lifecycle. In phase I of the project the investigators will focus on four measures: complexity, resource constraints, uncertainty, and data freshness. The study of each measure brings about foundational challenges that will require expertise from multiple TRIPODS disciplines to address.