( = Paper PDF,
= Presentation slides,
= Presentation video)
1.
Tajkia Rahman Toma; Cor-Paul Bezemer
An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications Inproceedings
3rd IEEE/ACM International Conference on AI Engineering - Software Engineering for AI (CAIN), pp. 1–11, 2024.
Abstract | BibTeX | Tags: Data maintenance, SE4ML
@inproceedings{TomaCAIN2024,
title = {An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications},
author = {Tajkia Rahman Toma and Cor-Paul Bezemer},
year = {2024},
date = {2024-01-17},
urldate = {2024-01-17},
booktitle = {3rd IEEE/ACM International Conference on AI Engineering - Software Engineering for AI (CAIN)},
pages = {1--11},
abstract = {Datasets and models are two key artifacts in machine learning
(ML) applications. Although there exist tools to support dataset
and model developers in managing ML artifacts, little is known
about how these datasets and models are integrated into ML ap-
plications. In this paper, we study how datasets and models in ML
applications are managed. In particular, we focus on how these
artifacts are stored and versioned alongside the applications. After
analyzing 93 repositories, we identified the most common storage
location to store datasets and models is the file system, which causes
availability issues. Notably, large data and model files, exceeding
approximately 60 MB, are stored exclusively in remote storage and
downloaded as needed. Most of the datasets and models lack proper
integration with the version control system, posing potential trace-
ability and reproducibility issues. Additionally, although datasets
and models are likely to evolve during the application development,
they are rarely updated in application repositories.},
keywords = {Data maintenance, SE4ML},
pubstate = {published},
tppubtype = {inproceedings}
}
Datasets and models are two key artifacts in machine learning
(ML) applications. Although there exist tools to support dataset
and model developers in managing ML artifacts, little is known
about how these datasets and models are integrated into ML ap-
plications. In this paper, we study how datasets and models in ML
applications are managed. In particular, we focus on how these
artifacts are stored and versioned alongside the applications. After
analyzing 93 repositories, we identified the most common storage
location to store datasets and models is the file system, which causes
availability issues. Notably, large data and model files, exceeding
approximately 60 MB, are stored exclusively in remote storage and
downloaded as needed. Most of the datasets and models lack proper
integration with the version control system, posing potential trace-
ability and reproducibility issues. Additionally, although datasets
and models are likely to evolve during the application development,
they are rarely updated in application repositories.
(ML) applications. Although there exist tools to support dataset
and model developers in managing ML artifacts, little is known
about how these datasets and models are integrated into ML ap-
plications. In this paper, we study how datasets and models in ML
applications are managed. In particular, we focus on how these
artifacts are stored and versioned alongside the applications. After
analyzing 93 repositories, we identified the most common storage
location to store datasets and models is the file system, which causes
availability issues. Notably, large data and model files, exceeding
approximately 60 MB, are stored exclusively in remote storage and
downloaded as needed. Most of the datasets and models lack proper
integration with the version control system, posing potential trace-
ability and reproducibility issues. Additionally, although datasets
and models are likely to evolve during the application development,
they are rarely updated in application repositories.