Publications – Analytics of Software, GAmes And Repository Data (ASGAARD) Lab

1.

Hao Li; Cor-Paul Bezemer; Ahmed E. Hassan

Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models Inproceedings

International Conference on Software Engineering - Software Engineering in Practice (ICSE - SEIP) Track, 2025.

Files:

Abstract | BibTeX | Tags: FM4SE, Foundation models, SE4AI, SE4FM, SE4ML

@inproceedings{Li_SEFM_blogs,

title = {Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models},

author = {Hao Li and Cor-Paul Bezemer and Ahmed E. Hassan},

year  = {2025},

date = {2025-04-27},

booktitle = {International Conference on Software Engineering - Software Engineering in Practice (ICSE - SEIP) Track},

abstract = {Foundation models (FMs) such as large language

models (LLMs) have significantly impacted many fields, including

software engineering (SE). The interaction between SE and FMs

has led to the integration of FMs into SE practices (FM4SE)

and the application of SE methodologies to FMs (SE4FM). While

several literature surveys exist on academic contributions to these

trends, we are the first to provide a practitioner’s view. We

analyze 155 FM4SE and 997 SE4FM blog posts from leading

technology companies, leveraging an FM-powered surveying

approach to systematically label and summarize the discussed

activities and tasks. We observed that while code generation is the

most prominent FM4SE task, FMs are leveraged for many other

SE activities such as code understanding, summarization, and

API recommendation. The majority of blog posts on SE4FM are

about model deployment & operation, and system architecture

& orchestration. Although the emphasis is on cloud deployments,

there is a growing interest in compressing FMs and deploying

them on smaller devices such as edge or mobile devices. We

outline eight future research directions inspired by our gained

insights, aiming to bridge the gap between academic findings

and real-world applications. Our study not only enriches the

body of knowledge on practical applications of FM4SE and

SE4FM but also demonstrates the utility of FMs as a powerful

and efficient approach in conducting literature surveys within

technical and grey literature domains. Our dataset, results, code

and used prompts can be found in our online replication package

at https://zenodo.org/records/14563992.},

keywords = {FM4SE, Foundation models, SE4AI, SE4FM, SE4ML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

2.

Tajkia Rahman Toma; Balreet Grewal; Cor-Paul Bezemer

Answering User Questions about Machine Learning Models through Standardized Model Cards Inproceedings

International Conference on Software Engineering (ICSE), 2025.

Files:

Abstract | BibTeX | Tags: Hugging Face, Q&A communities, Q&A websites, SE4AI, SE4FM, SE4ML

@inproceedings{Toma_UserQuestions,

title = {Answering User Questions about Machine Learning Models through Standardized Model Cards},

author = {Tajkia Rahman Toma and Balreet Grewal and Cor-Paul Bezemer },

year  = {2025},

date = {2025-04-27},

booktitle = {International Conference on Software Engineering (ICSE)},

abstract = {Reusing pre-trained machine learning models is

becoming very popular due to model hubs such as Hugging Face

(HF). However, similar to when reusing software, many issues

may arise when reusing an ML model. In many cases, users

resort to asking questions on discussion forums such as the HF

community forum. In this paper, we study how we can reduce the

community’s workload in answering these questions and increase

the likelihood that questions receive a quick answer. We analyze

11,278 discussions from the HF model community that contain

user questions about ML models. We focus on the effort spent

handling questions, the high-level topics of discussions, and the

potential for standardizing responses in model cards based on

a model card template. Our findings indicate that there is not

much effort involved in responding to user questions, however,

40.1% of the questions remain open without any response. A

topic analysis shows that discussions are more centered around

technical details on model development and troubleshooting,

indicating that more input from model providers is required. We

show that 42.5% of the questions could have been answered if the

model provider followed a standard model card template for the

model card. Based on our analysis, we recommend that model

providers add more development-related details on the model’s

architecture, algorithm, data preprocessing and training code in

existing documentation (sub)sections and add new (sub)sections

to the template to address common questions about model usage

and hardware requirements.},

keywords = {Hugging Face, Q&A communities, Q&A websites, SE4AI, SE4FM, SE4ML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

3.

Hao Li; Cor-Paul Bezemer

Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystems Journal Article

Empirical Software Engineering, 30 (6), 2024.

Files:

Abstract | BibTeX | Tags: Library bindings, Machine learning, SE4AI, SE4ML

@article{li_MLbindings,

title = {Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystems},

author = {Hao Li and Cor-Paul Bezemer},

year  = {2024},

date = {2024-10-18},

urldate = {2024-10-18},

journal = {Empirical Software Engineering},

volume = {30},

number = {6},

abstract = {Open source machine learning (ML) libraries enable developers to

integrate advanced ML functionality into their own applications. However,

popular ML libraries, such as TensorFlow, are not available natively in all

programming languages and software package ecosystems. Hence, developers

who wish to use an ML library which is not available in their programming lan-

guage or ecosystem of choice, may need to resort to using a so-called binding

library (or binding). Bindings provide support across programming languages

and package ecosystems for reusing a host library. For example, the Keras

.NET binding provides support for the Keras library in the NuGet (.NET)

ecosystem even though the Keras library was written in Python. In this pa-

per, we collect 2,436 cross-ecosystem bindings for 546 ML libraries across 13

software package ecosystems by using an approach called BindFind, which can

automatically identify bindings and link them to their host libraries. Further-

more, we conduct an in-depth study of 133 cross-ecosystem bindings and their

development for 40 popular open source ML libraries. Our findings reveal that

the majority of ML library bindings are maintained by the community, with

npm being the most popular ecosystem for these bindings. Our study also

indicates that most bindings cover only a limited range of the host library’s

releases, often experience considerable delays in supporting new releases, and

have widespread technical lag. Our findings highlight key factors to consider

for developers integrating bindings for ML libraries and open avenues for re-

searchers to further investigate bindings in software package ecosystems.},

keywords = {Library bindings, Machine learning, SE4AI, SE4ML},

pubstate = {published},

tppubtype = {article}

}

Close

4.

Hao Li; Gopi Krishnan Rajbahadur; Cor-Paul Bezemer

Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality Journal Article

ACM Transactions on Software Engineering and Methodology, 2024.

Files:

Abstract | BibTeX | Tags: Library bindings, Machine learning, SE4AI, SE4ML, Software quality

5.

Tajkia Rahman Toma; Cor-Paul Bezemer

An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications Inproceedings

3rd IEEE/ACM International Conference on AI Engineering - Software Engineering for AI (CAIN), pp. 1–11, 2024.

Files:

Abstract | BibTeX | Tags: Data maintenance, SE4ML