“Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!” accepted at SCAM! – Analytics of Software, GAmes And Repository Data (ASGAARD) Lab

Saeed’s paper “Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!” was accepted for publication at the International Working Conference on Source Code Analysis and Manipulation (SCAM) 2023! Super congrats Saeed!

Abstract: “The popularity of computational notebooks is rapidly increasing because of their interactive code-output visualization and on-demand non-sequential code block execution. These notebook features have made notebooks especially popular with machine learning developers and data scientists. However, as prior work shows, notebooks generally contain low quality code. In this paper, we investigate whether the low quality code is inherent to the programming style in notebooks, or whether it is correlated with the use of machine learning techniques. We present a large-scale empirical analysis of 246,599 open-source notebooks to explore how machine learning code quality in Jupyter Notebooks differs from non-machine learning code, thereby focusing on code style issues. We explored code style issues across the Error, Convention, Warning, and Refactoring
categories. We found that machine learning notebooks are of lower quality regarding PEP-8 code standards than non-machine learning notebooks, and their code quality distributions significantly differ with a small effect size. We identified several code style issues with large differences in occurrences between machine learning and non-machine learning notebooks. For example, package and import-related issues are more prevalent in machine learning notebooks. Our study shows that code quality and code style issues differ significantly across machine learning and non-machine learning notebooks.”

A preprint of the paper is available here.