“GlitchBench: Can large multimodal models detect video game glitches?” accepted at CVPR 2024!

Mohammad Reza’s paper “GlitchBench: Can large multimodal models detect video game glitches?” was accepted for publication at the CVPR 2024 conference! Super congrats Mohammad Reza and co-author Tianjun! This was a collaboration with Dr. Anh Nguyen from Auburn University.

Abstract: “Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting
out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/.”

A preprint of the paper is available here.

“Micro-FL: A Fault-Tolerant Scalable Microservice Based Platform for Federated Learning” accepted in Future Internet!

Mikael’s paper “Micro-FL: A Fault-Tolerant Scalable Microservice Based Platform for Federated Learning” was accepted for publication in the Future Internet journal! Super congrats Mikael!

Abstract: “As the number of machine learning applications increases, growing concerns about data privacy expose the limitations of traditional cloud-based machine learning methods that rely on centralized data collection and processing. Federated learning emerges as a promising alternative, offering a novel approach to training machine learning models that safeguards data privacy. Federated learning facilitates collaborative model training across various entities. In this approach, each user trains models locally and shares only the local model parameters with a central server, which then generates a global model based on these individual updates. This approach ensures data privacy since the training data itself is never directly shared with a central entity. However, existing federated machine learning frameworks are not without challenges. In terms of server design, these frameworks exhibit limited scalability with an increasing number of clients and are highly vulnerable to system faults, particularly as the central server becomes a single point of failure. This paper introduces Micro-FL, a federated learning framework that uses a microservices architecture to implement the federated learning system. It demonstrates that the framework is fault-tolerant and scalable, showing its ability to handle an increasing number of clients. A comprehensive performance evaluation confirms that Micro-FL proficiently handles component faults, enabling a smooth and uninterrupted operation.”

A preprint of the paper is available here.

“Searching bug instances in gameplay video repositories” accepted in IEEE’s Transactions on Games!

Mohammad Reza’s paper “Searching bug instances in gameplay video repositories” was accepted for publication in IEEE’s Transactions on Games! Super congrats Mohammad Reza!

Abstract: “Gameplay videos offer valuable insights into player interactions and game responses, particularly data about game bugs. Despite the abundance of gameplay videos online, extracting useful information remains a challenge. This paper introduces a method for searching and extracting relevant videos from extensive video repositories using English text queries. Our approach requires no external information, like video metadata; it solely depends on video content. Leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our approach, we present the GamePhysics dataset, comprising 26,954 videos from 1,873 games that were collected from the GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple and compound queries, indicating that our method is useful for detecting objects and events in gameplay videos. Moreover, we assess the effectiveness of our method by analyzing a carefully annotated dataset of 220 gameplay videos. The results of our study demonstrate
the potential of our approach for applications such as the creation of a video search tool tailored to identifying video game bugs, which could greatly benefit Quality Assurance (QA) teams in finding and reproducing bugs. The code and data used in this paper can be found at https://zenodo.org/records/10211390

A preprint of the paper is available here.

“An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications” accepted at CAIN 2024!

Tajkia’s paper “An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications” was accepted for publication at CAIN 2024! Super congrats Tajkia!

Abstract: “Datasets and models are two key artifacts in machine learning (ML) applications. Although there exist tools to support dataset and model developers in managing ML artifacts, little is known about how these datasets and models are integrated into ML applications. In this paper, we study how datasets and models in ML applications are managed. In particular, we focus on how these artifacts are stored and versioned alongside the applications. After analyzing 93 repositories, we identified the most common storage location to store datasets and models is the file system, which causes availability issues. Notably, large data and model files, exceeding approximately 60 MB, are stored exclusively in remote storage and downloaded as needed. Most of the datasets and models lack proper integration with the version control system, posing potential traceability and reproducibility issues. Additionally, although datasets and models are likely to evolve during the application development, they are rarely updated in application repositories.”

A preprint of the paper is available here.

“Analyzing Developer Use of ChatGPT Generated Code in Open Source GitHub Projects” accepted at MSR Mining Challenge 2024!

Balreet’s paper “Analyzing Developer Use of ChatGPT Generated Code in Open Source GitHub Projects” was accepted for publication at the MSR Mining Challenge 2024! Super congrats Balreet and co-author Wentao! This paper was a collaboration with Dr. Sarah Nadi from New York University Abu Dhabi.

Abstract: “The rapid development of large language models such as ChatGPT have made them particularly useful to developers in generating code snippets for their projects. To understand how ChatGPT’s generated code is leveraged by developers, we conducted an empirical study of 3,044 ChatGPT-generated code snippets integrated within GitHub projects. A median of 54% of the generated lines of code is found in the project’s code and this code typically remains unchanged once added. The modifications of the 76 code snippets that changed in a subsequent commit, consisted of minor functionality changes and code reorganizations that were made within a day. Our findings offer insights that help drive the development of AI-assisted programming tools. We highlight the importance of making changes in ChatGPT code before integrating it into a project.”

A preprint of the paper is available here.

“ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification” accepted at NeurIPS Dataset track 2023!

Mohammad Reza’s paper “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification” was accepted for publication at the Dataset and Benchmark track of NeurIPS 2023! Super congrats Mohammad Reza and co-author Giang! This paper was a collaboration with Sarra Habchi from our industry partner Ubisoft La Forge and Giang Nguyen and Anh Nguyen from Auburn University.

Abstract: “Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to zoom to the most discriminative
region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging from AlexNet to CLIP, we find that proper framing of the input image can lead to the correct classification of 98.91% of ImageNet images. Furthermore, we uncover positional biases in various datasets, especially a strong center bias in two popular datasets: ImageNet-A and ObjectNet. Finally, leveraging our insights into the potential of zooming, we propose a test-time augmentation (TTA) technique that improves classification accuracy by forcing models to explicitly perform zoom-in operations before making predictions. Our method is more interpretable, accurate, and faster than MEMO, a state-of-the-art (SOTA) TTA method. We introduce ImageNet-Hard, a new benchmark that challenges SOTA classifiers including large vision-language models even when optimal zooming is allowed.”

A preprint of the paper is available here.

“Prioritizing Natural Language Test Cases Based on Highly-Used Game Features” accepted at ESEC/FSE 2023!

Markos’ paper “Prioritizing Natural Language Test Cases Based on Highly-Used Game Features” was accepted for publication at the industry track of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2023! Super congrats Markos! This paper was a collaboration with Dale Paas from our industry partner Prodigy Education.

Abstract: “Software testing is still a manual activity in many industries, such as the gaming industry. But manually executing tests becomes impractical as the system grows and resources are restricted, mainly in a scenario with short release cycles. Test case prioritization is a commonly used technique to optimize the test execution. However, most prioritization approaches do not work for manual test cases as they require source code information or test execution history, which is often not available in a manual testing scenario. In this paper, we propose a prioritization approach for manual test cases written in natural language based on the tested application features (in particular, highly-used application features). Our approach consists of (1) identifying the tested features from natural language test cases (with zero-shot classification techniques) and (2) prioritizing
test cases based on the features that they test. We leveraged the NSGA-II genetic algorithm for the multi-objective optimization of
the test case ordering to maximize the coverage of highly-used features while minimizing the cumulative execution time. Our findings show that we can successfully identify the application features covered by test cases using an ensemble of pre-trained models with strong zero-shot capabilities (an F-score of 76.1%). Also, our prioritization approaches can find test case orderings that cover highly-used application features early in the test execution while keeping the time required to execute test cases short. QA engineers can use our approach to focus the test execution on test cases that cover features that are relevant to users.”

A preprint of the paper is available here.

“Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!” accepted at SCAM!

Saeed’s paper “Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!” was accepted for publication at the International Working Conference on Source Code Analysis and Manipulation (SCAM) 2023! Super congrats Saeed!

Abstract: “The popularity of computational notebooks is rapidly increasing because of their interactive code-output visualization and on-demand non-sequential code block execution. These notebook features have made notebooks especially popular with machine learning developers and data scientists. However, as prior work shows, notebooks generally contain low quality code. In this paper, we investigate whether the low quality code is inherent to the programming style in notebooks, or whether it is correlated with the use of machine learning techniques. We present a large-scale empirical analysis of 246,599 open-source notebooks to explore how machine learning code quality in Jupyter Notebooks differs from non-machine learning code, thereby focusing on code style issues. We explored code style issues across the Error, Convention, Warning, and Refactoring
categories. We found that machine learning notebooks are of lower quality regarding PEP-8 code standards than non-machine learning notebooks, and their code quality distributions significantly differ with a small effect size. We identified several code style issues with large differences in occurrences between machine learning and non-machine learning notebooks. For example, package and import-related issues are more prevalent in machine learning notebooks. Our study shows that code quality and code style issues differ significantly across machine learning and non-machine learning notebooks.”

A preprint of the paper is available here.

“A Taxonomy of Testable HTML5 Canvas Issues” accepted in TSE!

Finlay’s paper “” was accepted for publication in the Transactions on Software Engineering (TSE) journal! Super congrats Finlay (and co-author Markos!)! This paper was a collaboration with Natalia Romanova, Chris Buzon and Dale Paas from our industry partner Prodigy Education.

Abstract:
“The HTML5 canvas is widely used to display high quality graphics in web applications. However, the combination of
web, GUI, and visual techniques that are required to build canvas applications, together with the lack of testing and debugging
tools, makes developing such applications very challenging. To help direct future research on testing canvas applications, in this
paper we present a taxonomy of testable canvas issues. First, we extracted 2,403 canvas related issue reports from 123 open
source GitHub projects that use the HTML5 canvas. Second, we constructed our taxonomy by manually classifying a random
sample of 332 issue reports. Our manual classification identified five broad categories of testable canvas issues, such as Visual
and Performance issues. We found that Visual issues are the most frequent (35%), while Performance issues are relatively infrequent
(5%). We also found that many testable canvas issues that present themselves visually on the canvas are actually caused by
other components of the web application. Our taxonomy of testable canvas issues can be used to steer future research into
canvas issues and testing.”

See our Publications for the full paper.

“Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers” accepted for publication in EMSE!

Arthur’s paper “Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers” was accepted for publication in the Empirical Software Engineering journal! Super congrats Arthur! This was a collaboration with Dr. Abram Hindle.

Abstract:
Game development is currently the largest industry in the entertainment segment and has a high demand for skilled game developers that can produce high-quality games. To satiate this demand, game developers need resources that can provide them with the knowledge they need to learn and improve their skills. Question and Answer (Q&A) websites are one of such resources that provide a valuable source of knowledge about game development practices. However, the presence of duplicate questions on Q&A websites hinders their ability to effectively provide information for their users. While several researchers created and analyzed techniques for duplicate question detection on websites such as Stack Overflow, so far no studies have explored how well those techniques work on Q&A websites for game development. With that in mind, in this paper we analyze how we can use pre-trained and unsupervised techniques to detect duplicate questions on Q&A websites focused on game development using data extracted from the Game Development Stack Exchange and Stack Overflow. We also explore how we can leverage a small set of labelled data to improve the performance of those techniques. The pre-trained technique based on MPNet achieved the highest results in identifying duplicate questions about game development, and we could achieve a better performance when combining multiple unsupervised techniques into a single supervised model. Furthermore, the supervised models could identify duplicate questions on websites different from those they were trained on with little to no decrease in performance. Our results lay the groundwork for building better duplicate question detection systems in Q&A websites for game developers and ultimately providing game developers with a more effective Q&A community.

See our Publications for the full paper, or access the preprint directly.