( = Paper PDF, = Presentation slides, = Presentation video)
Mohammad Reza Taesiri; Tianjun Feng; Anh Nguyen; Cor-Paul Bezemer
GlitchBench: Can large multimodal models detect video game glitches? Inproceedings
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Abstract | BibTeX | Tags: Computer games, Foundation models, Game development, Gameplay videos, LLM
@inproceedings{TaesiriCVPR2024,
title = {GlitchBench: Can large multimodal models detect video game glitches?},
author = {Mohammad Reza Taesiri and Tianjun Feng and Anh Nguyen and Cor-Paul Bezemer},
year = {2024},
date = {2024-06-15},
urldate = {2024-03-15},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
abstract = {Large multimodal models (LMMs) have evolved from
large language models (LLMs) to integrate multiple input
modalities, such as visual inputs. This integration augments
the capacity of LLMs for tasks requiring visual comprehen-
sion and reasoning. However, the extent and limitations of
their enhanced abilities are not fully understood, especially
when it comes to real-world tasks. To address this gap, we
introduce GlitchBench, a novel benchmark derived from
video game quality assurance tasks, to test and evaluate the
reasoning capabilities of LMMs. Our benchmark is curated
from a variety of unusual and glitched scenarios from video
games and aims to challenge both the visual and linguis-
tic reasoning powers of LMMs in detecting and interpreting
out-of-the-ordinary events. We evaluate multiple state-of-
the-art LMMs, and we show that GlitchBench presents a
new challenge for these models. Code and data are avail-
able at: https://glitchbench.github.io/},
keywords = {Computer games, Foundation models, Game development, Gameplay videos, LLM},
pubstate = {published},
tppubtype = {inproceedings}
}
large language models (LLMs) to integrate multiple input
modalities, such as visual inputs. This integration augments
the capacity of LLMs for tasks requiring visual comprehen-
sion and reasoning. However, the extent and limitations of
their enhanced abilities are not fully understood, especially
when it comes to real-world tasks. To address this gap, we
introduce GlitchBench, a novel benchmark derived from
video game quality assurance tasks, to test and evaluate the
reasoning capabilities of LMMs. Our benchmark is curated
from a variety of unusual and glitched scenarios from video
games and aims to challenge both the visual and linguis-
tic reasoning powers of LMMs in detecting and interpreting
out-of-the-ordinary events. We evaluate multiple state-of-
the-art LMMs, and we show that GlitchBench presents a
new challenge for these models. Code and data are avail-
able at: https://glitchbench.github.io/
Balreet Grewal; Wentao Lu; Sarah Nadi; Cor-Paul Bezemer
Analyzing Developer Use of ChatGPT Generated Code in Open Source GitHub Projects Inproceedings
International Conference on Mining Software Repositories (MSR), 2024.
Abstract | BibTeX | Tags: Code reuse, LLM, SE4AI
@inproceedings{GrewalMSR2024,
title = {Analyzing Developer Use of ChatGPT Generated Code in Open Source GitHub Projects},
author = {Balreet Grewal and Wentao Lu and Sarah Nadi and Cor-Paul Bezemer },
year = {2024},
date = {2024-04-14},
urldate = {2024-04-14},
booktitle = {International Conference on Mining Software Repositories (MSR)},
abstract = {The rapid development of large language models such as ChatGPT
have made them particularly useful to developers in generating
code snippets for their projects. To understand how ChatGPT’s
generated code is leveraged by developers, we conducted an em-
pirical study of 3,044 ChatGPT-generated code snippets integrated
within GitHub projects. A median of 54% of the generated lines of
code is found in the project’s code and this code typically remains
unchanged once added. The modifications of the 76 code snippets
that changed in a subsequent commit, consisted of minor function-
ality changes and code reorganizations that were made within a
day. Our findings offer insights that help drive the development
of AI-assisted programming tools. We highlight the importance
of making changes in ChatGPT code before integrating it into a
project.},
keywords = {Code reuse, LLM, SE4AI},
pubstate = {published},
tppubtype = {inproceedings}
}
have made them particularly useful to developers in generating
code snippets for their projects. To understand how ChatGPT’s
generated code is leveraged by developers, we conducted an em-
pirical study of 3,044 ChatGPT-generated code snippets integrated
within GitHub projects. A median of 54% of the generated lines of
code is found in the project’s code and this code typically remains
unchanged once added. The modifications of the 76 code snippets
that changed in a subsequent commit, consisted of minor function-
ality changes and code reorganizations that were made within a
day. Our findings offer insights that help drive the development
of AI-assisted programming tools. We highlight the importance
of making changes in ChatGPT code before integrating it into a
project.
Mikael Sabuhi; Petr Musilek; Cor-Paul Bezemer
Micro-FL: A Fault-Tolerant Scalable Microservice Based Platform for Federated Learning Journal Article
Future Internet, 16 (3), pp. 1-19, 2024.
Abstract | BibTeX | Tags: Federated learning, Machine learning, Microservices
@article{Sabuhi_MicroFL,
title = {Micro-FL: A Fault-Tolerant Scalable Microservice Based Platform for Federated Learning},
author = {Mikael Sabuhi and Petr Musilek and Cor-Paul Bezemer },
year = {2024},
date = {2024-02-19},
journal = {Future Internet},
volume = {16},
number = {3},
pages = {1-19},
abstract = {As the number of machine learning applications increases, growing concerns about data privacy expose the limitations of traditional cloud-based machine learning methods that rely on centralized data collection and processing. Federated learning emerges as a promising alternative, offering a novel approach to training machine learning models that safeguards data privacy. Federated learning facilitates collaborative model training across various entities. In this approach, each user trains models locally and shares only the local model parameters with a central server, which then generates a global model based on these individual updates. This approach ensures data privacy since the training data itself is never directly shared with a central entity. However, existing federated machine learning frameworks are not without challenges. In terms of server design, these frameworks exhibit limited scalability with an increasing number of clients and are highly vulnerable to system faults, particularly as the central server becomes a single point of failure. This paper introduces Micro-FL, a federated learning framework that uses a microservices architecture to implement the federated learning system. It demonstrates that the framework is fault-tolerant and scalable, showing its ability to handle an increasing number of clients. A comprehensive performance evaluation confirms that Micro-FL proficiently handles component faults, enabling a smooth and uninterrupted operation.},
keywords = {Federated learning, Machine learning, Microservices},
pubstate = {published},
tppubtype = {article}
}
Tajkia Rahman Toma; Cor-Paul Bezemer
An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications Inproceedings
3rd IEEE/ACM International Conference on AI Engineering - Software Engineering for AI (CAIN), pp. 1–11, 2024.
Abstract | BibTeX | Tags: Data maintenance, SE4ML
@inproceedings{TomaCAIN2024,
title = {An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications},
author = {Tajkia Rahman Toma and Cor-Paul Bezemer},
year = {2024},
date = {2024-01-17},
urldate = {2024-01-17},
booktitle = {3rd IEEE/ACM International Conference on AI Engineering - Software Engineering for AI (CAIN)},
pages = {1--11},
abstract = {Datasets and models are two key artifacts in machine learning
(ML) applications. Although there exist tools to support dataset
and model developers in managing ML artifacts, little is known
about how these datasets and models are integrated into ML ap-
plications. In this paper, we study how datasets and models in ML
applications are managed. In particular, we focus on how these
artifacts are stored and versioned alongside the applications. After
analyzing 93 repositories, we identified the most common storage
location to store datasets and models is the file system, which causes
availability issues. Notably, large data and model files, exceeding
approximately 60 MB, are stored exclusively in remote storage and
downloaded as needed. Most of the datasets and models lack proper
integration with the version control system, posing potential trace-
ability and reproducibility issues. Additionally, although datasets
and models are likely to evolve during the application development,
they are rarely updated in application repositories.},
keywords = {Data maintenance, SE4ML},
pubstate = {published},
tppubtype = {inproceedings}
}
(ML) applications. Although there exist tools to support dataset
and model developers in managing ML artifacts, little is known
about how these datasets and models are integrated into ML ap-
plications. In this paper, we study how datasets and models in ML
applications are managed. In particular, we focus on how these
artifacts are stored and versioned alongside the applications. After
analyzing 93 repositories, we identified the most common storage
location to store datasets and models is the file system, which causes
availability issues. Notably, large data and model files, exceeding
approximately 60 MB, are stored exclusively in remote storage and
downloaded as needed. Most of the datasets and models lack proper
integration with the version control system, posing potential trace-
ability and reproducibility issues. Additionally, although datasets
and models are likely to evolve during the application development,
they are rarely updated in application repositories.
Mohammad Reza Taesiri; Finlay Macklon; Sarra Habchi; Cor-Paul Bezemer
Searching bug instances in gameplay video repositories Journal Article
IEEE Transactions on Games, 2024.
Abstract | BibTeX | Tags: Bug report, Computer games, Game development, Gameplay videos, Gaming
@article{TaesiriTG2024,
title = {Searching bug instances in gameplay video repositories},
author = {Mohammad Reza Taesiri and Finlay Macklon and Sarra Habchi and Cor-Paul Bezemer},
year = {2024},
date = {2024-01-17},
urldate = {2024-01-17},
journal = {IEEE Transactions on Games},
abstract = {Gameplay videos offer valuable insights into player interactions and game responses, particularly data about game bugs.
Despite the abundance of gameplay videos online, extracting useful information remains a challenge. This paper introduces a method
for searching and extracting relevant videos from extensive video repositories using English text queries. Our approach requires no
external information, like video metadata; it solely depends on video content. Leveraging the zero-shot transfer capabilities of the
Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our
approach, we present the GamePhysics dataset, comprising 26,954 videos from 1,873 games that were collected from the
GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple and compound
queries, indicating that our method is useful for detecting objects and events in gameplay videos. Moreover, we assess the
effectiveness of our method by analyzing a carefully annotated dataset of 220 gameplay videos. The results of our study demonstrate
the potential of our approach for applications such as the creation of a video search tool tailored to identifying video game bugs, which
could greatly benefit Quality Assurance (QA) teams in finding and reproducing bugs. The code and data used in this paper can be
found at https://zenodo.org/records/10211390},
keywords = {Bug report, Computer games, Game development, Gameplay videos, Gaming},
pubstate = {published},
tppubtype = {article}
}
Despite the abundance of gameplay videos online, extracting useful information remains a challenge. This paper introduces a method
for searching and extracting relevant videos from extensive video repositories using English text queries. Our approach requires no
external information, like video metadata; it solely depends on video content. Leveraging the zero-shot transfer capabilities of the
Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our
approach, we present the GamePhysics dataset, comprising 26,954 videos from 1,873 games that were collected from the
GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple and compound
queries, indicating that our method is useful for detecting objects and events in gameplay videos. Moreover, we assess the
effectiveness of our method by analyzing a carefully annotated dataset of 220 gameplay videos. The results of our study demonstrate
the potential of our approach for applications such as the creation of a video search tool tailored to identifying video game bugs, which
could greatly benefit Quality Assurance (QA) teams in finding and reproducing bugs. The code and data used in this paper can be
found at https://zenodo.org/records/10211390
Mohammad Reza Taesiri; Giang Nguyen; Sarra Habchi; Cor-Paul Bezemer; Anh Nguyen
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification Inproceedings
NeurIPS Dataset and Benchmark track, 2023.
BibTeX | Tags: Benchmark, Computer vision, Dataset, Image classification, Machine learning
@inproceedings{TaesiriNeurIPS2023,
title = {ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification},
author = {Mohammad Reza Taesiri and Giang Nguyen and Sarra Habchi and Cor-Paul Bezemer and Anh Nguyen},
year = {2023},
date = {2023-12-07},
urldate = {2023-12-07},
booktitle = {NeurIPS Dataset and Benchmark track},
keywords = {Benchmark, Computer vision, Dataset, Image classification, Machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Markos Viggiato; Dale Paas; Cor-Paul Bezemer
Prioritizing Natural Language Test Cases Based on Highly-Used Game Features Inproceedings
Proceedings of the 31st Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 1–12, 2023.
Abstract | BibTeX | Tags: Computer games, Game development, Natural language processing, Testing
@inproceedings{ViggiatoFSE2023,
title = {Prioritizing Natural Language Test Cases Based on Highly-Used Game Features},
author = {Markos Viggiato and Dale Paas and Cor-Paul Bezemer },
year = {2023},
date = {2023-12-01},
urldate = {2023-12-01},
booktitle = {Proceedings of the 31st Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)},
pages = {1--12},
abstract = {Software testing is still a manual activity in many industries, such
as the gaming industry. But manually executing tests becomes im-
practical as the system grows and resources are restricted, mainly
in a scenario with short release cycles. Test case prioritization is a
commonly used technique to optimize the test execution. However,
most prioritization approaches do not work for manual test cases
as they require source code information or test execution history,
which is often not available in a manual testing scenario. In this
paper, we propose a prioritization approach for manual test cases
written in natural language based on the tested application features
(in particular, highly-used application features). Our approach con-
sists of (1) identifying the tested features from natural language test
cases (with zero-shot classification techniques) and (2) prioritizing
test cases based on the features that they test. We leveraged the
NSGA-II genetic algorithm for the multi-objective optimization of
the test case ordering to maximize the coverage of highly-used
features while minimizing the cumulative execution time. Our find-
ings show that we can successfully identify the application features
covered by test cases using an ensemble of pre-trained models
with strong zero-shot capabilities (an F-score of 76.1%). Also, our
prioritization approaches can find test case orderings that cover
highly-used application features early in the test execution while
keeping the time required to execute test cases short. QA engineers
can use our approach to focus the test execution on test cases that
cover features that are relevant to users.},
keywords = {Computer games, Game development, Natural language processing, Testing},
pubstate = {published},
tppubtype = {inproceedings}
}
as the gaming industry. But manually executing tests becomes im-
practical as the system grows and resources are restricted, mainly
in a scenario with short release cycles. Test case prioritization is a
commonly used technique to optimize the test execution. However,
most prioritization approaches do not work for manual test cases
as they require source code information or test execution history,
which is often not available in a manual testing scenario. In this
paper, we propose a prioritization approach for manual test cases
written in natural language based on the tested application features
(in particular, highly-used application features). Our approach con-
sists of (1) identifying the tested features from natural language test
cases (with zero-shot classification techniques) and (2) prioritizing
test cases based on the features that they test. We leveraged the
NSGA-II genetic algorithm for the multi-objective optimization of
the test case ordering to maximize the coverage of highly-used
features while minimizing the cumulative execution time. Our find-
ings show that we can successfully identify the application features
covered by test cases using an ensemble of pre-trained models
with strong zero-shot capabilities (an F-score of 76.1%). Also, our
prioritization approaches can find test case orderings that cover
highly-used application features early in the test execution while
keeping the time required to execute test cases short. QA engineers
can use our approach to focus the test execution on test cases that
cover features that are relevant to users.
Md Saeed Siddik; Cor-Paul Bezemer
Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes! Inproceedings
23nd IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 1–12, IEEE, 2023.
Abstract | BibTeX | Tags: Computational notebooks, Empirical software engineering, Mining software repositories
@inproceedings{SiddikSCAM2023,
title = {Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!},
author = {Md Saeed Siddik and Cor-Paul Bezemer},
year = {2023},
date = {2023-10-03},
urldate = {2023-10-03},
booktitle = {23nd IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)},
pages = {1--12},
publisher = {IEEE},
abstract = {The popularity of computational notebooks is
rapidly increasing because of their interactive code-output vi-
sualization and on-demand non-sequential code block execution.
These notebook features have made notebooks especially popular
with machine learning developers and data scientists. However,
as prior work shows, notebooks generally contain low quality
code. In this paper, we investigate whether the low quality code
is inherent to the programming style in notebooks, or whether
it is correlated with the use of machine learning techniques.
We present a large-scale empirical analysis of 246,599 open-
source notebooks to explore how machine learning code quality
in Jupyter Notebooks differs from non-machine learning code,
thereby focusing on code style issues. We explored code style
issues across the Error, Convention, Warning, and Refactoring
categories. We found that machine learning notebooks are of
lower quality regarding PEP-8 code standards than non-machine
learning notebooks, and their code quality distributions signifi-
cantly differ with a small effect size. We identified several code
style issues with large differences in occurrences between machine
learning and non-machine learning notebooks. For example,
package and import-related issues are more prevalent in machine
learning notebooks. Our study shows that code quality and code
style issues differ significantly across machine learning and non-
machine learning notebooks.},
keywords = {Computational notebooks, Empirical software engineering, Mining software repositories},
pubstate = {published},
tppubtype = {inproceedings}
}
rapidly increasing because of their interactive code-output vi-
sualization and on-demand non-sequential code block execution.
These notebook features have made notebooks especially popular
with machine learning developers and data scientists. However,
as prior work shows, notebooks generally contain low quality
code. In this paper, we investigate whether the low quality code
is inherent to the programming style in notebooks, or whether
it is correlated with the use of machine learning techniques.
We present a large-scale empirical analysis of 246,599 open-
source notebooks to explore how machine learning code quality
in Jupyter Notebooks differs from non-machine learning code,
thereby focusing on code style issues. We explored code style
issues across the Error, Convention, Warning, and Refactoring
categories. We found that machine learning notebooks are of
lower quality regarding PEP-8 code standards than non-machine
learning notebooks, and their code quality distributions signifi-
cantly differ with a small effect size. We identified several code
style issues with large differences in occurrences between machine
learning and non-machine learning notebooks. For example,
package and import-related issues are more prevalent in machine
learning notebooks. Our study shows that code quality and code
style issues differ significantly across machine learning and non-
machine learning notebooks.
Mikael Sabuhi
Strategies For Building Performant Containerized Applications PhD Thesis
2023.
Abstract | BibTeX | Tags: Docker, Docker Hub, Microservices, Performance, Performance analysis, Performance engineering
@phdthesis{phd_mikael,
title = {Strategies For Building Performant Containerized Applications},
author = {Mikael Sabuhi},
year = {2023},
date = {2023-09-25},
urldate = {2023-09-25},
abstract = {The evolution of cloud computing in the last decade has offered unprecedented access to sizable, configurable computing resources with minimal management effort. Containerization of applications, particularly through Docker, has been pivotal in this progression. As modern software increasingly relies on various cloud services, designing performant cloud applications has emerged as a critical concern. Key attributes of such applications include reliability, scalability, efficiency, fault tolerance, and responsiveness. This thesis seeks to address the challenges intrinsic to creating performant cloud applications by developing strategies aimed at achieving these characteristics through: 1) the application of autoscaling techniques to enhance scalability, efficiency, and responsiveness; 2) the introduction of a methodology for assessing the impact of Docker image upgrades on containerized applications to prevent performance degradation; and 3) the utilization of microservices architecture to develop scalable, reliable, and fault-tolerant cloud applications. In our initial research, we propose a pioneering approach to optimize the performance and resource usage of containerized cloud applications using adaptive controllers grounded in control theory. Our methodology harnesses the capacity of neural networks to capture the intrinsic non-linearity of these applications, and adapts the parameters of a proportional-integral-derivative (PID) controller to accommodate environmental changes. The outcomes demonstrate significant enhancements in resource utilization and a reduction in service level agreement violations, surpassing the performance of other examined autoscaling techniques. In the subsequent study, we present a method to evaluate the performance implications of Docker image upgrades on cloud software systems and their correlation with application dependencies. Our case study of 90 official WordPress images underscores the need for comprehensive performance testing before upgrades, the importance of maintaining a performance repository for reporting test results, and the potential benefits of extending semantic versioning to encompass performance modifications. This investigation encourages an enlightened approach to Docker image management, promoting enhanced cloud application performance. Lastly, we introduce Micro-FL, a fault-tolerant federated learning framework crafted to enhance the reliability and scalability of cloud-based machine learning platforms. By incorporating a microservices-based architecture within Docker containers, Micro-FL overcomes challenges typically associated with federated learning, such as resource constraints, scalability, and system faults. Performance assessments demonstrate Micro-FL’s capability to efficiently manage faults and streamline federated learning processes, offering a more robust and scalable solution for federated learning. The research work presented in this thesis provides deep insights, actionable recommendations, and effective and thoroughly evaluated approaches for building performant cloud applications.
},
keywords = {Docker, Docker Hub, Microservices, Performance, Performance analysis, Performance engineering},
pubstate = {published},
tppubtype = {phdthesis}
}
Finlay Macklon; Markos Viggiato; Natalia Romanova; Chris Buzon; Dale Paas; Cor-Paul Bezemer
A Taxonomy of Testable HTML5 Canvas Issues Journal Article
Transactions of Software Engineering (TSE), 49 (6), pp. 3647–3659, 2023.
Abstract | BibTeX | Tags: Testing, Web applications
@article{MacklonTSE2023,
title = {A Taxonomy of Testable HTML5 Canvas Issues},
author = {Finlay Macklon and Markos Viggiato and Natalia Romanova and Chris Buzon and Dale Paas and Cor-Paul Bezemer},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
journal = {Transactions of Software Engineering (TSE)},
volume = {49},
number = {6},
pages = {3647--3659},
abstract = {The HTML5 canvas is widely used to display high quality graphics in web applications. However, the combination of
web, GUI, and visual techniques that are required to build canvas applications, together with the lack of testing and debugging
tools, makes developing such applications very challenging. To help direct future research on testing canvas applications, in this
paper we present a taxonomy of testable canvas issues. First, we extracted 2,403 canvas related issue reports from 123 open
source GitHub projects that use the HTML5 canvas. Second, we constructed our taxonomy by manually classifying a random
sample of 332 issue reports. Our manual classification identified five broad categories of testable canvas issues, such as Visual
and Performance issues. We found that Visual issues are the most frequent (35%), while Performance issues are relatively infrequent
(5%). We also found that many testable canvas issues that present themselves visually on the canvas are actually caused by
other components of the web application. Our taxonomy of testable canvas issues can be used to steer future research into
canvas issues and testing.},
keywords = {Testing, Web applications},
pubstate = {published},
tppubtype = {article}
}
web, GUI, and visual techniques that are required to build canvas applications, together with the lack of testing and debugging
tools, makes developing such applications very challenging. To help direct future research on testing canvas applications, in this
paper we present a taxonomy of testable canvas issues. First, we extracted 2,403 canvas related issue reports from 123 open
source GitHub projects that use the HTML5 canvas. Second, we constructed our taxonomy by manually classifying a random
sample of 332 issue reports. Our manual classification identified five broad categories of testable canvas issues, such as Visual
and Performance issues. We found that Visual issues are the most frequent (35%), while Performance issues are relatively infrequent
(5%). We also found that many testable canvas issues that present themselves visually on the canvas are actually caused by
other components of the web application. Our taxonomy of testable canvas issues can be used to steer future research into
canvas issues and testing.
Markos Viggiato
Leveraging Natural Language Processing Techniques to Improve Manual Game Testing PhD Thesis
2023.
Abstract | BibTeX | Tags: Computer games, Game development, Natural language processing, Testing
@phdthesis{ViggiatoPhD,
title = {Leveraging Natural Language Processing Techniques to Improve Manual Game Testing},
author = {Markos Viggiato },
year = {2023},
date = {2023-01-17},
urldate = {2023-01-17},
abstract = {The gaming industry has experienced a sharp growth in recent years, surpassing other popular entertainment segments, such as the film industry. With the ever-increasing scale of the gaming industry and the fact that players are extremely difficult to satisfy, it has become extremely challenging to develop a successful game. In this context, the quality of games has become a critical issue. Game testing is a widely-performed activity to ensure that games meet the desired quality criteria. However, despite recent advancements in test automation, manual game testing is still prevalent in the gaming industry, with test cases often described in natural language only and consisting of one or more test steps that must be manually performed by the Quality Assurance (QA) engineer (i.e., the tester). This makes game testing challenging and costly. Issues such as redundancy (i.e., when different test cases have the same testing objective) and incompleteness (i.e., when test cases miss one or more steps) become a bigger concern in a manual game testing scenario. In addition, as games become bigger and the number of required test cases increases, it becomes impractical to execute all test cases in a scenario with short game release cycles, for example.
Prior work proposed several approaches to analyze and improve test cases with associated source code. However, there is little research on improving manual game testing. Having higher-quality test cases and optimizing test execution help to reduce wasted developer time and allow testers to use testing resources more effectively, which makes game testing more efficient and effective. In addition, even though players are extremely difficult to satisfy, their priorities are not considered during game testing. In this thesis, we investigate how to improve manual game testing from different perspectives.
In the first part of the thesis, we investigated how we can reduce redundancy in the test suite by identifying similar natural language test cases. We evaluated several unsupervised approaches using text embedding, text similarity, and cluster-
ing techniques and showed that we can successfully identify similar test cases with a high performance. We also investigated how we can improve test case descriptions to reduce the number of unclear, ambiguous, and incomplete test cases. We proposed and evaluated an automated framework that leverages statistical and neural language models and (1) provides recommendations to improve test case descriptions, (2) recommends potentially missing steps, and (3) suggests existing similar test cases.
In the second part of the thesis, we investigated how player priorities can be included in the game testing process. We first proposed an approach to prioritize test cases that cover the game features that players use the most, which helps to avoid bugs that could affect a very large number of players. Our approach (1) identifies the game features covered by test cases using an ensemble of zero-shot techniques with a high performance and (2) optimizes the test execution based on highly-used game features covered by test cases. Finally, we investigated how sentiment classifiers perform on game reviews and what issues affect those classifiers. High-performing classifiers can be used to obtain players' sentiments about games and guide testing based on the game features that players like or dislike. We show that, while traditional sentiment classifiers do not perform well, a modern classifier (the OPT-175B Large Language Model) presents a (far) better performance. The research work presented in this thesis provides deep insights, actionable recommendations, and effective and thoroughly evaluated approaches to support QA engineers and developers to improve manual game testing.},
keywords = {Computer games, Game development, Natural language processing, Testing},
pubstate = {published},
tppubtype = {phdthesis}
}
Prior work proposed several approaches to analyze and improve test cases with associated source code. However, there is little research on improving manual game testing. Having higher-quality test cases and optimizing test execution help to reduce wasted developer time and allow testers to use testing resources more effectively, which makes game testing more efficient and effective. In addition, even though players are extremely difficult to satisfy, their priorities are not considered during game testing. In this thesis, we investigate how to improve manual game testing from different perspectives.
In the first part of the thesis, we investigated how we can reduce redundancy in the test suite by identifying similar natural language test cases. We evaluated several unsupervised approaches using text embedding, text similarity, and cluster-
ing techniques and showed that we can successfully identify similar test cases with a high performance. We also investigated how we can improve test case descriptions to reduce the number of unclear, ambiguous, and incomplete test cases. We proposed and evaluated an automated framework that leverages statistical and neural language models and (1) provides recommendations to improve test case descriptions, (2) recommends potentially missing steps, and (3) suggests existing similar test cases.
In the second part of the thesis, we investigated how player priorities can be included in the game testing process. We first proposed an approach to prioritize test cases that cover the game features that players use the most, which helps to avoid bugs that could affect a very large number of players. Our approach (1) identifies the game features covered by test cases using an ensemble of zero-shot techniques with a high performance and (2) optimizes the test execution based on highly-used game features covered by test cases. Finally, we investigated how sentiment classifiers perform on game reviews and what issues affect those classifiers. High-performing classifiers can be used to obtain players' sentiments about games and guide testing based on the game features that players like or dislike. We show that, while traditional sentiment classifiers do not perform well, a modern classifier (the OPT-175B Large Language Model) presents a (far) better performance. The research work presented in this thesis provides deep insights, actionable recommendations, and effective and thoroughly evaluated approaches to support QA engineers and developers to improve manual game testing.
Arthur V. Kamienski; Abram Hindle; Cor-Paul Bezemer
Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers Journal Article
Empirical Software Engineering Journal (EMSE), 28 (17), 2022.
@article{arthur2022,
title = {Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers},
author = {Arthur V. Kamienski and Abram Hindle and Cor-Paul Bezemer},
year = {2022},
date = {2022-12-08},
journal = {Empirical Software Engineering Journal (EMSE)},
volume = {28},
number = {17},
abstract = {Game development is currently the largest industry in the entertainment segment and has a high demand for skilled game developers that can produce high-quality games. To satiate this demand, game developers need resources that can provide them with the knowledge they need to learn and improve their skills. Question and Answer (Q&A) websites are one of such resources that provide a valuable source of knowledge about game development practices. However, the presence of duplicate questions on Q&A websites hinders their ability to effectively provide information for their users. While several researchers created and analyzed techniques for duplicate question detection on websites such as Stack Overflow, so far no studies have explored how well those techniques work on Q&A websites for game development. With that in mind, in this paper we analyze how we can use pre-trained and unsupervised techniques to detect duplicate questions on Q&A websites focused on game development using data extracted from the Game Development Stack Exchange and Stack Overflow. We also explore how we can leverage a small set of labelled data to improve the performance of those techniques. The pre-trained technique based on MPNet achieved the highest results in identifying duplicate questions about game development, and we could achieve a better performance when combining multiple unsupervised techniques into a single supervised model. Furthermore, the supervised models could identify duplicate questions on websites different from those they were trained on with little to no decrease in performance. Our results lay the groundwork for building better duplicate question detection systems in Q&A websites for game developers and ultimately providing game developers with a more effective Q&A community.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Finlay Macklon; Mohammad Reza Taesiri; Markos Viggiato; Stefan Antoszko; Natalia Romanova; Dale Paas; Cor-Paul Bezemer
Automatically Detecting Visual Bugs in HTML5 <canvas> Games Inproceedings
37th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2022.
BibTeX | Tags: Computer games, Game development, Gaming, Regression testing, Testing, Web applications
@inproceedings{finlay_ase2022,
title = {Automatically Detecting Visual Bugs in HTML5
Luisa Palechor
Characterizing (un)successful open source blockchain projects and their testing practices Masters Thesis
2022.
Abstract | BibTeX | Tags: blockchain, Smart contracts, Testing
@mastersthesis{luisa2022,
title = {Characterizing (un)successful open source blockchain projects and their testing practices},
author = {Luisa Palechor},
year = {2022},
date = {2022-09-26},
urldate = {2022-09-26},
abstract = {The most well-known blockchain applications are cryptocurrencies, e.g., Ether and Bitcoin, which both sum a market cap of more than 560 billion US dollars. Besides cryptocurrency applications, programmable blockchain allows the development of different applications, e.g., peer-to-peer selling of renewable energy from smart grids, digital rights management, and supply chain tracking and operation. These applications can be developed and deployed on the blockchain through smart contracts, which are small programs that run on the blockchain under particular conditions. As bugs in blockchain applications (in particular, cryptocurrencies) can have large financial impact, it is important to ensure that these applications are well-developed and well-tested. However, currently software development and testing practices of blockchain projects are largely unstudied. In this thesis, we study data from GitHub and CoinMarketCap to understand the characteristics of successful and unsuccessful blockchain projects and reveal the testing practices in Solidity projects with the aim of helping developers to identify projects from which they can learn, or should contribute to. In the first part of the thesis, we study data from CoinMarketCap and GitHub to gain knowledge about the characteristics of successful and unsuccessful blockchain projects. We build a random forest classifier with 320 labelled projects and metrics from 3 dimensions (activity, popularity, and complexity). We found that a large number of stars and a project’s age can help distinguish between successful and unsuccessful projects. Additionally, we found that code cloning practices tend to be common in unsuccessful projects written in Python, C++, Java and Solidity. In the second part of the thesis, we explore how quality is addressed in blockchain applications by studying how 139 open source Solidity projects are tested. We show that core development team members are the developers who usually contribute to
testing files, leaving external contributions rare. In addition, our results indicate that only functional testing is practiced among the majority of Solidity projects, with Truffle and Hardhat being the tools commonly used to test Solidity smart contracts. Moreover, security testing is a practice rarely conducted, and performance testing is not conducted at all. We finally found that audits by a third party are common in several smart contracts. Future researchers and developers can use our findings to understand what characterizes successful and unsuccessful blockchain projects and be aware of the testing practices developers conduct in open source blockchain projects.},
keywords = {blockchain, Smart contracts, Testing},
pubstate = {published},
tppubtype = {mastersthesis}
}
testing files, leaving external contributions rare. In addition, our results indicate that only functional testing is practiced among the majority of Solidity projects, with Truffle and Hardhat being the tools commonly used to test Solidity smart contracts. Moreover, security testing is a practice rarely conducted, and performance testing is not conducted at all. We finally found that audits by a third party are common in several smart contracts. Future researchers and developers can use our findings to understand what characterizes successful and unsuccessful blockchain projects and be aware of the testing practices developers conduct in open source blockchain projects.
Markos Viggiato; Dale Paas; Chris Buzon; Cor-Paul Bezemer
Using Natural Language Processing Techniques to Improve Manual Test Case Descriptions Inproceedings
International Conference on Software Engineering - Software Engineering in Practice (ICSE - SEIP) Track, 2022.
@inproceedings{ViggiatoSEIP2022,
title = {Using Natural Language Processing Techniques to Improve Manual Test Case Descriptions},
author = {Markos Viggiato and Dale Paas and Chris Buzon and Cor-Paul Bezemer},
year = {2022},
date = {2022-05-08},
booktitle = {International Conference on Software Engineering - Software Engineering in Practice (ICSE - SEIP) Track},
abstract = {Despite the recent advancements in test automation, software test-
ing often remains a manual, and costly, activity in many industries.
Manual test cases, often described only in natural language, con-
sist of one or more test steps, which are instructions that must be
performed to achieve the testing objective. Having different em-
ployees specifying test cases might result in redundant, unclear, or
incomplete test cases. Manually reviewing and validating newly-
specified test cases is time-consuming and becomes impractical in a
scenario with a large test suite. Therefore, in this paper, we propose
an automated framework to automatically analyze test cases that
are specified in natural language and provide actionable recommen-
dations on how to improve the test cases. Our framework consists
of configurable components and modules for analysis, which are
capable of recommending improvements to the following: (1) the
terminology of a new test case through language modeling, (2) po-
tentially missing test steps for a new test case through frequent
itemset and association rule mining, and (3) recommendation of
similar test cases that already exist in the test suite through text em-
bedding and clustering. We thoroughly evaluated the three modules
on data from our industry partner. Our framework can provide ac-
tionable recommendations, which is an important challenge given
the widespread occurrence of test cases that are described only in
natural language in the software industry (in particular, the game
industry).},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
ing often remains a manual, and costly, activity in many industries.
Manual test cases, often described only in natural language, con-
sist of one or more test steps, which are instructions that must be
performed to achieve the testing objective. Having different em-
ployees specifying test cases might result in redundant, unclear, or
incomplete test cases. Manually reviewing and validating newly-
specified test cases is time-consuming and becomes impractical in a
scenario with a large test suite. Therefore, in this paper, we propose
an automated framework to automatically analyze test cases that
are specified in natural language and provide actionable recommen-
dations on how to improve the test cases. Our framework consists
of configurable components and modules for analysis, which are
capable of recommending improvements to the following: (1) the
terminology of a new test case through language modeling, (2) po-
tentially missing test steps for a new test case through frequent
itemset and association rule mining, and (3) recommendation of
similar test cases that already exist in the test suite through text em-
bedding and clustering. We thoroughly evaluated the three modules
on data from our industry partner. Our framework can provide ac-
tionable recommendations, which is an important challenge given
the widespread occurrence of test cases that are described only in
natural language in the software industry (in particular, the game
industry).
Markos Viggiato; Dale Paas; Chris Buzon; Cor-Paul Bezemer
Identifying Similar Test Cases That Are Specified in Natural Language Journal Article
Transactions of Software Engineering (TSE), 2022.
Abstract | BibTeX | Tags: Game development, Testing
@article{ViggiatoTSE2022,
title = {Identifying Similar Test Cases That Are Specified in Natural Language},
author = {Markos Viggiato and Dale Paas and Chris Buzon and Cor-Paul Bezemer},
year = {2022},
date = {2022-04-21},
urldate = {2022-04-21},
journal = {Transactions of Software Engineering (TSE)},
abstract = {Software testing is still a manual process in many industries, despite the recent improvements in automated testing
techniques. As a result, test cases (which consist of one or more test steps that need to be executed manually by the tester) are often
specified in natural language by different employees and many redundant test cases might exist in the test suite. This increases the
(already high) cost of test execution. Manually identifying similar test cases is a time-consuming and error-prone task. Therefore, in this
paper, we propose an unsupervised approach to identify similar test cases. Our approach uses a combination of text embedding, text
similarity and clustering techniques to identify similar test cases. We evaluate five different text embedding techniques, two text
similarity metrics, and two clustering techniques to cluster similar test steps and three techniques to identify similar test cases from the
test step clusters. Through an evaluation in an industrial setting, we showed that our approach achieves a high performance to cluster
test steps (an F-score of 87.39%) and identify similar test cases (an F-score of 83.47%). Furthermore, a validation with developers
indicates several different practical usages of our approach (such as identifying redundant test cases), which help to reduce the testing
manual effort and time.},
keywords = {Game development, Testing},
pubstate = {published},
tppubtype = {article}
}
techniques. As a result, test cases (which consist of one or more test steps that need to be executed manually by the tester) are often
specified in natural language by different employees and many redundant test cases might exist in the test suite. This increases the
(already high) cost of test execution. Manually identifying similar test cases is a time-consuming and error-prone task. Therefore, in this
paper, we propose an unsupervised approach to identify similar test cases. Our approach uses a combination of text embedding, text
similarity and clustering techniques to identify similar test cases. We evaluate five different text embedding techniques, two text
similarity metrics, and two clustering techniques to cluster similar test steps and three techniques to identify similar test cases from the
test step clusters. Through an evaluation in an industrial setting, we showed that our approach achieves a high performance to cluster
test steps (an F-score of 87.39%) and identify similar test cases (an F-score of 83.47%). Furthermore, a validation with developers
indicates several different practical usages of our approach (such as identifying redundant test cases), which help to reduce the testing
manual effort and time.
Mikael Sabuhi; Petr Musilek; Cor-Paul Bezemer
Studying the Performance Risks of Upgrading Docker Hub Images: A Case Study of WordPress Inproceedings
ACM/SPEC International Conference on Performance Engineering (ICPE), 2022.
@inproceedings{SabuhiICPE2022,
title = {Studying the Performance Risks of Upgrading Docker Hub Images: A Case Study of WordPress},
author = {Mikael Sabuhi and Petr Musilek and Cor-Paul Bezemer},
year = {2022},
date = {2022-04-09},
booktitle = {ACM/SPEC International Conference on Performance Engineering (ICPE)},
abstract = {The Docker Hub repository contains Docker images of applica-
tions, which allow users to do in-place upgrades to benefit from
the latest released features and security patches. However, prior
work showed that upgrading a Docker image not only changes
the main application, but can also change many dependencies. In
this paper, we present a methodology to study the performance im-
pact of upgrading the Docker Hub image of an application, thereby
focusing on changes to dependencies. We demonstrate our method-
ology through a case study of 90 official images of the WordPress
application. Our study shows that Docker image users should be
cautious and conduct a performance test before upgrading to a
newer Docker image in most cases. Our methodology can assist
them to better understand the performance risks of such upgrades,
and helps them to decide how thorough such a performance test
should be.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
tions, which allow users to do in-place upgrades to benefit from
the latest released features and security patches. However, prior
work showed that upgrading a Docker image not only changes
the main application, but can also change many dependencies. In
this paper, we present a methodology to study the performance im-
pact of upgrading the Docker Hub image of an application, thereby
focusing on changes to dependencies. We demonstrate our method-
ology through a case study of 90 official images of the WordPress
application. Our study shows that Docker image users should be
cautious and conduct a performance test before upgrading to a
newer Docker image in most cases. Our methodology can assist
them to better understand the performance risks of such upgrades,
and helps them to decide how thorough such a performance test
should be.
Mohammad Reza Taesiri; Finlay Macklon; Cor-Paul Bezemer
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning Inproceedings
International Conference on Mining Software Repositories (MSR), 2022.
Abstract | BibTeX | Tags: Bug report, Computer games, Game development, Gameplay videos, Gaming
@inproceedings{TaesiriMSR2022,
title = {CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning},
author = {Mohammad Reza Taesiri and Finlay Macklon and Cor-Paul Bezemer},
year = {2022},
date = {2022-03-24},
urldate = {2022-03-24},
booktitle = {International Conference on Mining Software Repositories (MSR)},
abstract = {Gameplay videos contain rich information about how players inter-
act with the game and how the game responds. Sharing gameplay
videos on social media platforms, such as Reddit, has become a
common practice for many players. Often, players will share game-
play videos that showcase video game bugs. Such gameplay videos
are software artifacts that can be utilized for game testing, as they
provide insight for bug analysis. Although large repositories of
gameplay videos exist, parsing and mining them in an effective and
structured fashion has still remained a big challenge. In this paper,
we propose a search method that accepts any English text query as
input to retrieve relevant videos from large repositories of gameplay
videos. Our approach does not rely on any external information
(such as video metadata); it works solely based on the content of the
video. By leveraging the zero-shot transfer capabilities of the Con-
trastive Language-Image Pre-Training (CLIP) model, our approach
does not require any data labeling or training. To evaluate our ap-
proach, we present the GamePhysics dataset consisting of 26,954
videos from 1,873 games, that were collected from the GamePhysics
section on the Reddit website. Our approach shows promising re-
sults in our extensive analysis of simple queries, compound queries,
and bug queries, indicating that our approach is useful for object
and event detection in gameplay videos. An example application
of our approach is as a gameplay video search engine to aid in
reproducing video game bugs. Please visit the following link for the
code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/},
keywords = {Bug report, Computer games, Game development, Gameplay videos, Gaming},
pubstate = {published},
tppubtype = {inproceedings}
}
act with the game and how the game responds. Sharing gameplay
videos on social media platforms, such as Reddit, has become a
common practice for many players. Often, players will share game-
play videos that showcase video game bugs. Such gameplay videos
are software artifacts that can be utilized for game testing, as they
provide insight for bug analysis. Although large repositories of
gameplay videos exist, parsing and mining them in an effective and
structured fashion has still remained a big challenge. In this paper,
we propose a search method that accepts any English text query as
input to retrieve relevant videos from large repositories of gameplay
videos. Our approach does not rely on any external information
(such as video metadata); it works solely based on the content of the
video. By leveraging the zero-shot transfer capabilities of the Con-
trastive Language-Image Pre-Training (CLIP) model, our approach
does not require any data labeling or training. To evaluate our ap-
proach, we present the GamePhysics dataset consisting of 26,954
videos from 1,873 games, that were collected from the GamePhysics
section on the Reddit website. Our approach shows promising re-
sults in our extensive analysis of simple queries, compound queries,
and bug queries, indicating that our approach is useful for object
and event detection in gameplay videos. An example application
of our approach is as a gameplay video search engine to aid in
reproducing video game bugs. Please visit the following link for the
code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/
Simon Eismann; Diego Costa; Lizhi Liao; Cor-Paul Bezemer; Weiyi Shang; André van Hoorn; Samuel Kounev
A Case Study on the Stability of Performance Tests for Serverless Applications Journal Article
Journal of Systems and Software, 2022.
Abstract | BibTeX | Tags: Performance engineering, Performance regressions, Performance testing, Serverless
@article{EismannJSS2022,
title = {A Case Study on the Stability of Performance Tests for Serverless Applications},
author = {Simon Eismann and Diego Costa and Lizhi Liao and Cor-Paul Bezemer and Weiyi Shang and André van Hoorn and Samuel Kounev},
year = {2022},
date = {2022-03-17},
urldate = {2022-03-17},
journal = {Journal of Systems and Software},
abstract = {Context. While in serverless computing, application resource management and operational concerns are generally delegated to the cloud provider, ensuring that serverless applications meet their performance requirements is still a responsibility of the developers. Performance testing is a commonly used performance assessment practice; however, it traditionally requires visibility of the resource environment.
Objective. In this study, we investigate whether performance tests of serverless applications are stable, that is, if their results are reproducible, and what implications the serverless paradigm has for performance tests.
Method. We conduct a case study where we collect two datasets of performance test results: (a) repetitions of performance tests for varying memory size and load intensities and (b) three repetitions of the same performance test every day for ten months.
Results. We find that performance tests of serverless applications are comparatively stable if conducted on the same day. However, we also observe short-term performance variations and frequent long-term performance changes.
Conclusion. Performance tests for serverless applications can be stable; however, the serverless model impacts the planning, execution, and analysis of performance tests.},
keywords = {Performance engineering, Performance regressions, Performance testing, Serverless},
pubstate = {published},
tppubtype = {article}
}
Objective. In this study, we investigate whether performance tests of serverless applications are stable, that is, if their results are reproducible, and what implications the serverless paradigm has for performance tests.
Method. We conduct a case study where we collect two datasets of performance test results: (a) repetitions of performance tests for varying memory size and load intensities and (b) three repetitions of the same performance test every day for ten months.
Results. We find that performance tests of serverless applications are comparatively stable if conducted on the same day. However, we also observe short-term performance variations and frequent long-term performance changes.
Conclusion. Performance tests for serverless applications can be stable; however, the serverless model impacts the planning, execution, and analysis of performance tests.
Luisa Palechor; Cor-Paul Bezemer
How are Solidity smart contracts tested in open source projects? An exploratory study Inproceedings
3rd IEEE/ACM International Conference on Automation of Software Test (AST), 2022.
Abstract | BibTeX | Tags: Smart contracts, Testing
@inproceedings{PalechorAST2022,
title = {How are Solidity smart contracts tested in open source projects? An exploratory study},
author = {Luisa Palechor and Cor-Paul Bezemer},
year = {2022},
date = {2022-03-10},
urldate = {2022-03-10},
booktitle = {3rd IEEE/ACM International Conference on Automation of Software Test (AST)},
abstract = {Smart contracts are self-executing programs that are stored on the
blockchain. Once a smart contract is compiled and deployed on
the blockchain, it cannot be modified. Therefore, having a bug-
free smart contract is vital. To ensure a bug-free smart contract,
it must be tested thoroughly. However, little is known about how
developers test smart contracts in practice. Our study explores 139
open source smart contract projects that are written in Solidity
to investigate the state of smart contract testing from three di-
mensions: (1) the developers working on the tests, (2) the used
testing frameworks and testnets and (3) the type of tests that are
conducted. We found that mostly core developers of a project are
responsible for testing the contracts. Second, developers typically
use only functional testing frameworks to test a smart contract,
with Truffle being the most popular one. Finally, our results show
that functional testing is conducted in most of the studied projects
(93%), security testing is only performed in a few projects (9.4%) and
traditional performance testing is conducted in none. In addition,
we found 25 projects that mentioned or published external audit
reports.},
keywords = {Smart contracts, Testing},
pubstate = {published},
tppubtype = {inproceedings}
}
blockchain. Once a smart contract is compiled and deployed on
the blockchain, it cannot be modified. Therefore, having a bug-
free smart contract is vital. To ensure a bug-free smart contract,
it must be tested thoroughly. However, little is known about how
developers test smart contracts in practice. Our study explores 139
open source smart contract projects that are written in Solidity
to investigate the state of smart contract testing from three di-
mensions: (1) the developers working on the tests, (2) the used
testing frameworks and testnets and (3) the type of tests that are
conducted. We found that mostly core developers of a project are
responsible for testing the contracts. Second, developers typically
use only functional testing frameworks to test a smart contract,
with Truffle being the most popular one. Finally, our results show
that functional testing is conducted in most of the studied projects
(93%), security testing is only performed in a few projects (9.4%) and
traditional performance testing is conducted in none. In addition,
we found 25 projects that mentioned or published external audit
reports.