Professor of Computer Science at Lyon 1 University
Angela Bonifati is a Professor of Computer Science at Lyon 1 University and at the CNRS Liris research lab, where she leads the Database Group. She is also an Adjunct Professor at the University of Waterloo in Canada and a Senior member of French University Institute (IUF). Her current research interests are on several aspects of data management, including graph databases, knowledge graphs, data integration and data science. She has co-authored several publications in top venues of the data management field, including three Best Paper Awards, two books and an invited paper in ACM Sigmod Record 2018. She is the recipient of the TCDE Impact Award 2023 and a co-recipient of an ACM Research Highlights Award 2023. She was the Program Chair of ACM Sigmod 2022 and she is currently an Associate Editor for the Proceedings of VLDB and for several other journals, including the VLDB Journal, IEEE TKDE and ACM TODS. She is the President of the EDBT Executive Board (2020-2025) and was an interim member of the ACM Sigmod Executive Committee (2022-2023).
Abstract: Towards Quality-driven and AI-assisted Data Science
One of the key processes of data science pipelines is data preparation, which aims at cleaning and curating the data for the subsequent analytical and inference steps. Data preparation deals with the errors and conflicts introduced into the input datasets during data collection and acquisition. These errors, such as violations of business rules, typos, missing values, replicated entries and abnormal features, are of different kinds depending on the nature of the data, ranging from structured data to graph-shaped data and time series. If these errors are kept into the data, they can propagate to the results of data science processes and also hamper their efficiency and trustworthiness.
The talk presents our latest results on enhancing the quality of querying and inference tasks in data science operating on different kinds of heterogeneous data. Among the others, we focus on real-life healthcare applications and provide the domain experts with useful AI-assisted data management techniques that can help them with their diagnoses and analyses. First, inconsistency-aware annotations can quantify the amount of quality for structured data input to analytical processes. These annotations are further exploited during query processing in order to enhance the output of queries with inconsistency degrees. Second, feature-based similarities among time series corresponding to patients’ signals help to better identify groups of patients and to assess their risks for a particular disease. Third, violations of graph constraints can be addressed by human-guided feedback and lead to better accuracy of the repairing algorithms for graph-shaped data.