( = Paper PDF, = Presentation slides, = Presentation video)
Sara Gholami; Alireza Goli; Cor-Paul Bezemer; Hamzeh Khazaei
A Framework for Satisfying the Performance Requirements of Containerized Software Systems Through Multi-Versioning Inproceedings
ACM/SPEC International Conference on Performance Engineering (ICPE), pp. 1–11, 2019.
Abstract | BibTeX | Tags: Containerized Software Systems, Performance Requirements, Software Multi-versioning
@inproceedings{Sara20,
title = {A Framework for Satisfying the Performance Requirements of Containerized Software Systems Through Multi-Versioning},
author = {Sara Gholami and Alireza Goli and Cor-Paul Bezemer and Hamzeh Khazaei},
year = {2019},
date = {2019-12-13},
urldate = {2019-12-13},
booktitle = {ACM/SPEC International Conference on Performance Engineering (ICPE)},
pages = {1--11},
abstract = {With the increasing popularity and complexity of containerized software systems, satisfying the performance requirements of these systems becomes more challenging as well. While a common remedy to this problem is to increase the allocated amount of resources by scaling up or out, this remedy is not necessarily cost-effective and therefore often problematic for smaller companies.
In this paper, we study an alternative, more cost-effective approach for satisfying the performance requirements of containerized software systems. In particular, we investigate how we can satisfy such requirements by applying software multi-versioning to the system’s resource-heavy containers. We present DockerMV, an open source extension of the Docker framework, to support multiversioning of containerized software systems. We demonstrate the efficacy of multi-versioning for satisfying the performance requirements of containerized software systems through experiments on the TeaStore, a microservice reference test application, and Znn, a containerized news portal. Our DockerMV extension can be used by software developers to introduce multi-versioning in their own containerized software systems, thereby better allowing them to meet the performance requirements of their systems.},
keywords = {Containerized Software Systems, Performance Requirements, Software Multi-versioning},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we study an alternative, more cost-effective approach for satisfying the performance requirements of containerized software systems. In particular, we investigate how we can satisfy such requirements by applying software multi-versioning to the system’s resource-heavy containers. We present DockerMV, an open source extension of the Docker framework, to support multiversioning of containerized software systems. We demonstrate the efficacy of multi-versioning for satisfying the performance requirements of containerized software systems through experiments on the TeaStore, a microservice reference test application, and Znn, a containerized news portal. Our DockerMV extension can be used by software developers to introduce multi-versioning in their own containerized software systems, thereby better allowing them to meet the performance requirements of their systems.
Daniel Lee; Dayi Lin; Cor-Paul Bezemer; Ahmed E. Hassan
Building the Perfect Game - An Empirical Study of Game Modifications Journal Article
Empirical Software Engineering Journal (EMSE), 2019.
Abstract | BibTeX | Tags: Game development, Gaming, Nexus mod
@article{Daniel2019nexusmods,
title = {Building the Perfect Game - An Empirical Study of Game Modifications},
author = {Daniel Lee and Dayi Lin and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2019},
date = {2019-10-17},
urldate = {2019-10-17},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {Prior work has shown that gamer loyalty is important for the sales of a developer’s future games. Therefore, it is important for game developers to increase the longevity of their games. However, game developers cannot always meet the growing and changing needs of the gaming community, due to the often already overloaded schedules of developers. So-called modders can potentially assist game developers with addressing gamers’ needs. Modders are enthusiasts who provide modifications or completely new content for a game. By supporting modders, game developers can meet the rapidly growing and varying needs of their gamer base. Modders have the potential to play a role in extending the life expectancy of a game, thereby saving game developers time and money, and leading to a better overall gaming experience for their gamer base.
In this paper, we empirically study the metadata of 9,521 mods that were extracted from the Nexus Mods distribution platform. The Nexus Mods distribution platform is one of the largest mod distribution platforms for PC games at the time of our study. The goal of our paper is to provide useful insights about mods on the Nexus Mods distribution platform from a quantitative perspective, and to provide researchers a solid foundation to further explore game mods. To better understand the potential of mods to extend the longevity of a game we study their characteristics, and we study their release schedules and post-release support (in terms of bug reports) as a proxy for the willingness of the modding community to contribute to a game. We find that providing official support for mods can be beneficial for the perceived quality of the mods of a game: games for which a modding tool is provided by the original game developer have a higher median endorsement ratio than mods for games that do not have such a tool. In addition, mod users are willing to submit bug reports for a mod. However, they often fail to do this in a systematic manner using the bug reporting tool of the Nexus Mods platform, resulting in low-quality bug reports which are difficult to resolve.
Our findings give the first insights into the characteristics, release schedule and post-release support of game mods. Our findings show that some games have a very active modding community, which contributes to those games through mods. Based on our findings, we recommend that game developers who desire an active modding community for their own games provide the modding community with an officially supported modding tool. In addition, we recommend that mod distribution platforms,
such as Nexus Mods, improve their bug reporting system to receive higher quality bug reports.},
keywords = {Game development, Gaming, Nexus mod},
pubstate = {published},
tppubtype = {article}
}
In this paper, we empirically study the metadata of 9,521 mods that were extracted from the Nexus Mods distribution platform. The Nexus Mods distribution platform is one of the largest mod distribution platforms for PC games at the time of our study. The goal of our paper is to provide useful insights about mods on the Nexus Mods distribution platform from a quantitative perspective, and to provide researchers a solid foundation to further explore game mods. To better understand the potential of mods to extend the longevity of a game we study their characteristics, and we study their release schedules and post-release support (in terms of bug reports) as a proxy for the willingness of the modding community to contribute to a game. We find that providing official support for mods can be beneficial for the perceived quality of the mods of a game: games for which a modding tool is provided by the original game developer have a higher median endorsement ratio than mods for games that do not have such a tool. In addition, mod users are willing to submit bug reports for a mod. However, they often fail to do this in a systematic manner using the bug reporting tool of the Nexus Mods platform, resulting in low-quality bug reports which are difficult to resolve.
Our findings give the first insights into the characteristics, release schedule and post-release support of game mods. Our findings show that some games have a very active modding community, which contributes to those games through mods. Based on our findings, we recommend that game developers who desire an active modding community for their own games provide the modding community with an officially supported modding tool. In addition, we recommend that mod distribution platforms,
such as Nexus Mods, improve their bug reporting system to receive higher quality bug reports.
Md. Ahasanuzzaman; Safwat Hassan; Cor-Paul Bezemer; Ahmed E. Hassan
A Longitudinal Study of Popular Ad Libraries in the Google Play Store Journal Article
Empirical Software Engineering Journal (EMSE), 2019.
Abstract | BibTeX | Tags: Ad library, Android mobile apps, Google Play Store, Longitudinal study, Software engineering
@article{Ahsan2019adlibraries,
title = {A Longitudinal Study of Popular Ad Libraries in the Google Play Store},
author = {Md. Ahasanuzzaman and Safwat Hassan and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2019},
date = {2019-08-08},
urldate = {2019-08-08},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {In-app advertisements have become an integral part of the revenue
model of mobile apps. To gain ad revenue, app developers integrate ad libraries into their apps. Such libraries are integrated to serve advertisements (ads) to users; developers then gain revenue based on the displayed ads and the users’ interactions with such ads. As a result, ad libraries have become an essential part of the mobile app ecosystem. However, little is known about how such ad libraries have evolved over time.
In this paper, we study the evolution of the 8 most popular ad libraries
(e.g., Google AdMob and Facebook Audience Network) over a period of 33 months (from April 2016 until December 2018). In particular, we look at their evolution in terms of size, the main drivers for releasing a new version, and their architecture. To identify popular ad libraries, we collect 35,462 updates of 1,840 top free-to-download apps in the Google Play Store. Then, we identify 63 ad libraries that are integrated into the studied popular apps. We observe that an ad library represents 10% of the binary size of mobile apps, and that the proportion of the ad library size compared to the app size has grown by 10% over our study period. By taking a closer look at the 8 most popular ad libraries, we find that ad libraries are continuously evolving with a median release interval of 34 days. In addition, we observe that some libraries have grown exponentially in size (e.g, Facebook Audience Network), while other libraries have attempted to reduce their size as they evolved. The libraries that reduced their size have done so through: (1) creating a lighter version of the ad library, (2) removing parts of the ad library, and (3) redesigning their architecture into a more modular one.
To identify the main drivers for releasing a new version, we manually analyze the release notes of the eight studied ad libraries. We observe that fixing issues that are related to displaying video ads is the main driver for releasing new versions. We also observe that ad library developers are constantly updating their libraries to support a wider range of Android platforms (i.e., to ensure that more devices can use the libraries without errors). Finally, we derive a reference architecture from the studied eight ad libraries, and we study how these libraries deviated from this architecture in the study period.
Our study is important for ad library developers as it provides the first in-depth look into how the important mobile app market segment of ad libraries has evolved. Our findings and the reference architecture are valuable for ad library developers who wish to learn about how other developers built and evolved their successful ad libraries. For example, our reference architecture provides a new ad library developer with a foundation for understanding the interactions between the most important components of an ad library.},
keywords = {Ad library, Android mobile apps, Google Play Store, Longitudinal study, Software engineering},
pubstate = {published},
tppubtype = {article}
}
model of mobile apps. To gain ad revenue, app developers integrate ad libraries into their apps. Such libraries are integrated to serve advertisements (ads) to users; developers then gain revenue based on the displayed ads and the users’ interactions with such ads. As a result, ad libraries have become an essential part of the mobile app ecosystem. However, little is known about how such ad libraries have evolved over time.
In this paper, we study the evolution of the 8 most popular ad libraries
(e.g., Google AdMob and Facebook Audience Network) over a period of 33 months (from April 2016 until December 2018). In particular, we look at their evolution in terms of size, the main drivers for releasing a new version, and their architecture. To identify popular ad libraries, we collect 35,462 updates of 1,840 top free-to-download apps in the Google Play Store. Then, we identify 63 ad libraries that are integrated into the studied popular apps. We observe that an ad library represents 10% of the binary size of mobile apps, and that the proportion of the ad library size compared to the app size has grown by 10% over our study period. By taking a closer look at the 8 most popular ad libraries, we find that ad libraries are continuously evolving with a median release interval of 34 days. In addition, we observe that some libraries have grown exponentially in size (e.g, Facebook Audience Network), while other libraries have attempted to reduce their size as they evolved. The libraries that reduced their size have done so through: (1) creating a lighter version of the ad library, (2) removing parts of the ad library, and (3) redesigning their architecture into a more modular one.
To identify the main drivers for releasing a new version, we manually analyze the release notes of the eight studied ad libraries. We observe that fixing issues that are related to displaying video ads is the main driver for releasing new versions. We also observe that ad library developers are constantly updating their libraries to support a wider range of Android platforms (i.e., to ensure that more devices can use the libraries without errors). Finally, we derive a reference architecture from the studied eight ad libraries, and we study how these libraries deviated from this architecture in the study period.
Our study is important for ad library developers as it provides the first in-depth look into how the important mobile app market segment of ad libraries has evolved. Our findings and the reference architecture are valuable for ad library developers who wish to learn about how other developers built and evolved their successful ad libraries. For example, our reference architecture provides a new ad library developer with a foundation for understanding the interactions between the most important components of an ad library.
Diego Costa; Cor-Paul Bezemer; Philipp Leitner; Artur Andrzejak
What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks Journal Article
The Transactions of Software Engineering Journal (TSE), 2019.
Abstract | BibTeX | Tags: Bad practices, JMH, Microbenchmarking, Performance testing, Static analysis
@article{diego_tse,
title = {What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks},
author = {Diego Costa and Cor-Paul Bezemer and Philipp Leitner and Artur Andrzejak},
year = {2019},
date = {2019-06-17},
urldate = {2019-06-17},
journal = {The Transactions of Software Engineering Journal (TSE)},
publisher = {IEEE},
abstract = {Microbenchmarking frameworks, such as Java’s Microbenchmark Harness (JMH), allow developers to write fine-grained performance test suites at the method or statement level. However, due to the complexities of the Java Virtual Machine, developers often struggle with writing expressive JMH benchmarks which accurately represent the performance of such methods or statements. In this paper, we empirically study bad practices of JMH benchmarks. We present a tool that leverages static analysis to identify 5 bad JMH practices. Our empirical study of 123 open source Java-based systems shows that each of these 5 bad practices are prevalent in open source software. Further, we conduct several experiments to quantify the impact of each bad practice in multiple case studies, and find that bad practices often significantly impact the benchmark results. To validate our experimental results, we constructed seven patches that fix the identified bad practices for six of the studied open source projects, of which six were merged into the main branch of the project. In this paper, we show that developers struggle with accurate Java microbenchmarking, and provide several recommendations to developers of microbenchmarking frameworks on how to improve future versions of their framework.},
keywords = {Bad practices, JMH, Microbenchmarking, Performance testing, Static analysis},
pubstate = {published},
tppubtype = {article}
}
Jiayuan Zhou; Shaowei Wang; Cor-Paul Bezemer; Ahmed E. Hassan
Bounties on Technical Q&A Sites: A Case Study of Stack Overflow Bounties Journal Article
Empirical Software Engineering Journal (EMSE), 2019.
@article{Zhou2019sobounties,
title = {Bounties on Technical Q&A Sites: A Case Study of Stack Overflow Bounties},
author = {Jiayuan Zhou and Shaowei Wang and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2019},
date = {2019-06-12},
urldate = {2019-06-12},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {Technical question and answer (Q&A) websites provide a platform for developers to communicate with each other by asking and answering questions. Stack Overflow is the most prominent of such websites. With the rapidly increasing number of questions on Stack Overflow, it is becoming difficult to get an answer to all questions and as a result, millions of questions on Stack Overflow remain unsolved. In an attempt to improve the visibility of unsolved questions, Stack Overflow introduced a bounty system to motivate users to solve such questions. In this bounty system, users can offer reputation points in an effort to encourage users to answer their question.
In this paper, we study 129,202 bounty questions that were proposed
by 61,824 bounty backers. We observe that bounty questions have a higher solving-likelihood than non-bounty questions. This is particularly true for longstanding unsolved questions. For example, questions that were unsolved for 100 days for which a bounty is proposed are more likely to be solved (55%) than those without bounties (1.7%).
In addition, we studied the factors that are important for the solving-likelihood and solving-time of a bounty question. We found that: (1) Questions are likely to attract more traffic after receiving a bounty than non-bounty questions. (2) Bounties work particularly well in very large communities with a relatively low question solving-likelihood. (3) High-valued bounties are associated with a higher solving-likelihood, but we did not observe a likelihood for expedited solutions.
Our study shows that while bounties are not a silver bullet for getting a question solved, they are associated with a higher solving-likelihood of a question in most cases. As questions that are still unsolved after two days hardly receive any traffic, we recommend that Stack Overflow users propose a bounty as soon as possible after those two days for the bounty to have the highest impact.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
In this paper, we study 129,202 bounty questions that were proposed
by 61,824 bounty backers. We observe that bounty questions have a higher solving-likelihood than non-bounty questions. This is particularly true for longstanding unsolved questions. For example, questions that were unsolved for 100 days for which a bounty is proposed are more likely to be solved (55%) than those without bounties (1.7%).
In addition, we studied the factors that are important for the solving-likelihood and solving-time of a bounty question. We found that: (1) Questions are likely to attract more traffic after receiving a bounty than non-bounty questions. (2) Bounties work particularly well in very large communities with a relatively low question solving-likelihood. (3) High-valued bounties are associated with a higher solving-likelihood, but we did not observe a likelihood for expedited solutions.
Our study shows that while bounties are not a silver bullet for getting a question solved, they are associated with a higher solving-likelihood of a question in most cases. As questions that are still unsolved after two days hardly receive any traffic, we recommend that Stack Overflow users propose a bounty as soon as possible after those two days for the bounty to have the highest impact.
Masanari Kondo; Cor-Paul Bezemer; Yasutaka Kamei; Ahmed E. Hassan; Osamu Mizuno
The Impact of Feature Reduction Techniques on Defect Prediction Models Journal Article
Empirical Software Engineering Journal (EMSE), 2019.
@article{Kondo2020featurereduction,
title = {The Impact of Feature Reduction Techniques on Defect Prediction Models},
author = {Masanari Kondo and Cor-Paul Bezemer and Yasutaka Kamei and Ahmed E. Hassan and Osamu Mizuno},
year = {2019},
date = {2019-06-01},
urldate = {2019-06-01},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature
reduction techniques on defect prediction is investigated.
In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work).
The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
reduction techniques on defect prediction is investigated.
In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work).
The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.
Shahnaz M. Shariff; Heng Li; Cor-Paul Bezemer; Ahmed E. Hassan; Thanh H. D. Nguyen; Parminder Flora
Improving the Testing Efficiency of Selenium-based Load Tests Inproceedings
14th IEEE/ACM International Workshop on Automation of Software Test (AST), pp. 1–7, IEEE/ACM, 2019.
@inproceedings{shahnaz19,
title = {Improving the Testing Efficiency of Selenium-based Load Tests},
author = {Shahnaz M. Shariff and Heng Li and Cor-Paul Bezemer and Ahmed E. Hassan and Thanh H. D. Nguyen and Parminder Flora},
year = {2019},
date = {2019-05-27},
urldate = {2019-05-27},
booktitle = {14th IEEE/ACM International Workshop on Automation of Software Test (AST)},
pages = {1--7},
publisher = {IEEE/ACM},
abstract = {Web applications must be load tested to analyze their behavior under various load conditions. Typically, these load tests are automated using protocol-level HTTP requests (e.g., using JMETER). However, there are several disadvantages to using protocol-level requests for load tests. For example, protocol-level requests are only partially representative of the true usage of a web application, as the web application is not actually executed in a browser. It can be difficult to abstract complex behavior, such as a login sequence, into requests without executing the application. Browser-based load testing can be used as an alternative to protocol-level requests. Using a browser-based testing framework, such as SELENIUM, tests can be executed more realistically — inside a browser. Unfortunately, because a
browser instance must be started to conduct a test, browser-based testing has a high performance overhead which limits its applicability for load tests. In this paper, we propose an approach for reducing the performance overhead of running SELENIUM-based load tests. Our approach shares browser instances between test user instances, thereby reducing the performance overhead that is introduced by launching many browser instances during the execution of a test. Our experimental results show that our approach can significantly increase the number of user instances that can be tested on a test machine without overloading the load driver. Our approach and the experiences that we share in this paper can help software practitioners improve the efficiency of their own SELENIUM-based load tests.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
browser instance must be started to conduct a test, browser-based testing has a high performance overhead which limits its applicability for load tests. In this paper, we propose an approach for reducing the performance overhead of running SELENIUM-based load tests. Our approach shares browser instances between test user instances, thereby reducing the performance overhead that is introduced by launching many browser instances during the execution of a test. Our experimental results show that our approach can significantly increase the number of user instances that can be tested on a test machine without overloading the load driver. Our approach and the experiences that we share in this paper can help software practitioners improve the efficiency of their own SELENIUM-based load tests.
Dayi Lin; Cor-Paul Bezemer; Ahmed E. Hassan
Identifying Gameplay Videos that Exhibit Bugs in Computer Games Journal Article
Empirical Software Engineering Journal (EMSE), 2019.
Abstract | BibTeX | Tags: Bug report, Computer games, Gameplay videos, Steam
@article{Lin2019videos,
title = {Identifying Gameplay Videos that Exhibit Bugs in Computer Games},
author = {Dayi Lin and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2019},
date = {2019-05-21},
urldate = {2019-05-21},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {With the rapid growing market and competition in the gaming industry, it is challenging to develop a successful game, making the quality of games very important. To improve the quality of games, developers commonly use gamer-submitted bug reports to locate bugs in games. Recently, gameplay videos have become popular in the gaming community. A few of these videos showcase a bug, offering developers a new opportunity to collect context-rich bug information.
In this paper, we investigate whether videos that showcase a bug can automatically be identified from the metadata of gameplay videos that are readily available online. Such bug videos could then be used as a supplemental source of bug information for game developers. We studied the number of gameplay videos on the Steam platform, one of the most popular digital game distribution platforms, and the difficulty of identifying bug videos from these gameplay videos. We show that naïve approaches such as using keywords to search for bug videos are time-consuming and imprecise. We propose an approach which uses a random forest classifier to rank gameplay videos based on their likelihood of being a bug video. Our proposed approach achieves a precision that is 43% higher than that of the naïve keyword searching approach on a manually labelled dataset of 96 videos. In addition, by evaluating 1,400 videos that are identified by our approach as bug videos, we calculated that our approach has both a mean average precision at 10 and a mean average precision at 100 of 0.91. Our study demonstrates that it is feasible to automatically identify gameplay videos that showcase a bug.},
keywords = {Bug report, Computer games, Gameplay videos, Steam},
pubstate = {published},
tppubtype = {article}
}
In this paper, we investigate whether videos that showcase a bug can automatically be identified from the metadata of gameplay videos that are readily available online. Such bug videos could then be used as a supplemental source of bug information for game developers. We studied the number of gameplay videos on the Steam platform, one of the most popular digital game distribution platforms, and the difficulty of identifying bug videos from these gameplay videos. We show that naïve approaches such as using keywords to search for bug videos are time-consuming and imprecise. We propose an approach which uses a random forest classifier to rank gameplay videos based on their likelihood of being a bug video. Our proposed approach achieves a precision that is 43% higher than that of the naïve keyword searching approach on a manually labelled dataset of 96 videos. In addition, by evaluating 1,400 videos that are identified by our approach as bug videos, we calculated that our approach has both a mean average precision at 10 and a mean average precision at 100 of 0.91. Our study demonstrates that it is feasible to automatically identify gameplay videos that showcase a bug.
Yuhao Wu; Shaowei Wang; Cor-Paul Bezemer; Katsuro Inoue
How Do Developers Utilize Source Code from Stack Overflow? Journal Article
The Empirical Software Engineering Journal (EMSE), 24 (2), pp. 637–673, 2019.
@article{frank17reuse,
title = {How Do Developers Utilize Source Code from Stack Overflow?},
author = {Yuhao Wu and Shaowei Wang and Cor-Paul Bezemer and Katsuro Inoue},
year = {2019},
date = {2019-04-15},
urldate = {2019-04-15},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {24},
number = {2},
pages = {637--673},
publisher = {Springer},
abstract = {Technical question and answer Q&A platforms, such as Stack Overflow, provide a platform for users to ask and answer questions about a wide variety of programming topics. These platforms accumulate a large amount of knowledge, including hundreds of thousands lines of source code. Developers can benefit from the source code that is attached to the questions and answers on Q&A platforms by copying or learning from (parts of) it. By understanding how developers utilize source code from Q&A platforms, we can provide insights for researchers which can be used to improve next-generation Q&A platforms to help developers reuse source code fast and easily. In this paper, we first conduct an exploratory study on 289 files from 182 open-source projects, which contain source code that has an explicit reference to a Stack Overflow post. Our goal is to understand how developers utilize code from Q&A platforms and to reveal barriers that may make code reuse more difficult. In 31.5% of the studied files, developers needed to modify source code from Stack Overflow to make it work in their own projects. The degree of required modification varied from simply renaming variables to rewriting the whole algorithm. Developers sometimes chose to implement an algorithm from scratch based on the descriptions from Stack Overflow answers, even if there was an implementation readily available in the post. In 35.5% of the studied files, developers used Stack Overflow posts as an information source for later reference. To further understand the barriers of reusing code and to obtain suggestions for improving the code reuse process on Q&A platforms, we conducted a survey with 453 open-source developers who are also on Stack Overflow. We found that the top 3 barriers that make it difficult for developers to reuse code from Stack Overflow are: (1) too much code modification required to fit in their projects, (2) incomprehensive code, and (3) low code quality. We summarized and analyzed all survey responses and we identified that developers suggest improvements for future Q&A platforms along the following dimensions: code quality, information enhancement & management, data organization, license, and the human factor. For instance, developers suggest to improve the code quality by adding an integrated validator that can test source code online, and an outdated code detection mechanism. Our findings can be used as a roadmap for researchers and developers to improve code reuse.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Cor-Paul Bezemer; Simon Eismann; Vincenzo Ferme; Johannes Grohmann; Robert Heinrich; Pooyan Jamshidi; Weiyi Shang; André van Hoorn; Monica Villavicencio; Jürgen Walter; Felix Willnecker
How is performance addressed in DevOps? A Survey on Industrial Practices Inproceedings
Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (ICPE), 2019.
@inproceedings{devops_survey,
title = {How is performance addressed in DevOps? A Survey on Industrial Practices},
author = {Cor-Paul Bezemer and Simon Eismann and Vincenzo Ferme and Johannes Grohmann and Robert Heinrich and Pooyan Jamshidi and Weiyi Shang and André van Hoorn and Monica Villavicencio and Jürgen Walter and Felix Willnecker},
year = {2019},
date = {2019-04-04},
urldate = {2019-04-04},
booktitle = {Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (ICPE)},
abstract = {DevOps is a modern software engineering paradigm that is gaining
widespread adoption in industry. The goal of DevOps is to bring
software changes into production with a high frequency and fast
feedback cycles. This conflicts with software quality assurance
activities, particularly with respect to performance. For instance,
performance evaluation activities — such as load testing — require a
considerable amount of time to get statistically significant results.
We conducted an industrial survey to get insights into how per-
formance is addressed in industrial DevOps settings. In particular,
we were interested in the frequency of executing performance
evaluations, the tools being used, the granularity of the obtained
performance data, and the use of model-based techniques. The sur-
vey responses, which come from a wide variety of participants from
different industry sectors, indicate that the complexity of perfor-
mance engineering approaches and tools is a barrier for wide-spread
adoption of performance analysis in DevOps. The implication of
our results is that performance analysis tools need to have a short
learning curve, and should be easy to integrate into the DevOps
pipeline in order to be adopted by practitioners.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
widespread adoption in industry. The goal of DevOps is to bring
software changes into production with a high frequency and fast
feedback cycles. This conflicts with software quality assurance
activities, particularly with respect to performance. For instance,
performance evaluation activities — such as load testing — require a
considerable amount of time to get statistically significant results.
We conducted an industrial survey to get insights into how per-
formance is addressed in industrial DevOps settings. In particular,
we were interested in the frequency of executing performance
evaluations, the tools being used, the granularity of the obtained
performance data, and the use of model-based techniques. The sur-
vey responses, which come from a wide variety of participants from
different industry sectors, indicate that the complexity of perfor-
mance engineering approaches and tools is a barrier for wide-spread
adoption of performance analysis in DevOps. The implication of
our results is that performance analysis tools need to have a short
learning curve, and should be easy to integrate into the DevOps
pipeline in order to be adopted by practitioners.
Hanyang Hu; Shaowei Wang; Cor-Paul Bezemer; Ahmed E. Hassan
Studying the Consistency of Star Ratings, Reviews and Releases of Top Free Hybrid Android and iOS Apps Journal Article
The Empirical Software Engineering Journal (EMSE), 24 (1), pp. 7–32, 2019.
Abstract | BibTeX | Tags: Mobile apps, Star rating, Twitter-LDA, User reviews
@article{hu17hybrid,
title = {Studying the Consistency of Star Ratings, Reviews and Releases of Top Free Hybrid Android and iOS Apps},
author = {Hanyang Hu and Shaowei Wang and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2019},
date = {2019-02-15},
urldate = {2019-02-15},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {24},
number = {1},
pages = {7--32},
publisher = {Springer},
abstract = {Nowadays, many developers make their mobile apps available on multiple platforms (e.g., Android and iOS). However, maintaining several versions of a cross-platform app that is built natively (i.e., using platform-specific tools) is a complicated task. Instead, developers can choose to use hybrid development tools, such as PhoneGap, to build hybrid apps. Hybrid apps are based on a single codebase across platforms. There exist two ways to use a hybrid development tool to build a hybrid app that runs on multiple platforms: (1) using web technologies (i.e., HTML, Javascript and CSS) and (2) in a common language, which is then converted to native code.
We study whether these hybrid development tools achieve their main purpose: delivering an app that is perceived similarly by users across platforms. Prior studies show that users refer to star ratings and user reviews, when deciding to download an app. Given the importance of star ratings and user reviews, we study whether the usage of a hybrid development tool assists app developers in achieving consistency in the star ratings and user reviews across multiple platforms.
We study 68 hybrid app-pairs, i.e., apps that exist both in the Google Play store and Apple App store. We find that 33 out of 68 hybrid apps do not receive consistent star ratings across platforms. We run Twitter-LDA on user reviews and find that the star ratings of the reviews that discuss the same topic could be up to three times as high across platforms. Our findings suggest that while hybrid apps are better at providing consistent star ratings and user reviews when compared to cross-platform apps that are built natively, hybrid apps do not guarantee such consistency. Hence, developers should not solely rely on hybrid development tools to achieve consistency in the star ratings and reviews that are given by users of their apps. In particular, developers should track closely the ratings and reviews of their apps across platforms, so that they can act accordingly when platform-specific issues arise.},
keywords = {Mobile apps, Star rating, Twitter-LDA, User reviews},
pubstate = {published},
tppubtype = {article}
}
We study whether these hybrid development tools achieve their main purpose: delivering an app that is perceived similarly by users across platforms. Prior studies show that users refer to star ratings and user reviews, when deciding to download an app. Given the importance of star ratings and user reviews, we study whether the usage of a hybrid development tool assists app developers in achieving consistency in the star ratings and user reviews across multiple platforms.
We study 68 hybrid app-pairs, i.e., apps that exist both in the Google Play store and Apple App store. We find that 33 out of 68 hybrid apps do not receive consistent star ratings across platforms. We run Twitter-LDA on user reviews and find that the star ratings of the reviews that discuss the same topic could be up to three times as high across platforms. Our findings suggest that while hybrid apps are better at providing consistent star ratings and user reviews when compared to cross-platform apps that are built natively, hybrid apps do not guarantee such consistency. Hence, developers should not solely rely on hybrid development tools to achieve consistency in the star ratings and reviews that are given by users of their apps. In particular, developers should track closely the ratings and reviews of their apps across platforms, so that they can act accordingly when platform-specific issues arise.
Mohamed Sami Rakha; Cor-Paul Bezemer; Ahmed E. Hassan
Revisiting the Performance of Automated Approaches for the Retrieval of Duplicate Reports in Issue Tracking Systems that Perform Just-in-Time Duplicate Retrieval Journal Article
The Empirical Software Engineering Journal (EMSE), 23 (5), pp. 2597–2621, 2018.
@article{sami17jit,
title = {Revisiting the Performance of Automated Approaches for the Retrieval of Duplicate Reports in Issue Tracking Systems that Perform Just-in-Time Duplicate Retrieval},
author = {Mohamed Sami Rakha and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2018},
date = {2018-12-01},
urldate = {2018-12-01},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {23},
number = {5},
pages = {2597--2621},
publisher = {Springer},
abstract = {Issue tracking systems (ITSs) allow software end-users and developers to file issue reports and change requests. Reports are frequently duplicately filed for the same software issue. The retrieval of these duplicate issue reports is a tedious manual task. Prior research proposed several automated approaches for the retrieval of duplicate issue reports. Recent versions of ITSs added a feature that does basic retrieval of duplicate issue reports at the filing time of an issue report in an effort to avoid the filing of duplicates as early as possible.
This paper investigates the impact of this just-in-time duplicate retrieval on the duplicate reports that end up in the ITS of an open source project. In particular, we study the differences between duplicate reports for open source projects before and after the activation of this new feature. We show how the experimental results of prior research would vary given the new data after the activation of the just-in-time duplicate retrieval feature. We study duplicate issue reports from the Mozilla-Firefox, Mozilla-Core and Eclipse-Platform projects. In addition, we compare the performance of the state of the art of the automated retrieval of duplicate reports using two popular approaches (i.e., BM25F and REP).
We find that duplicate issue reports after the activation of the just-in-time duplicate retrieval feature are less textually similar, have a greater identification delay and require more discussion to be retrieved as duplicate reports than duplicates before the activation of the feature. Prior work showed that REP outperforms BM25F in terms of Recall rate and Mean average precision. We observe that the performance gap between BM25F and REP becomes even larger after the activation of the just-in-time duplicate retrieval feature. We recommend that future studies focus on duplicates that were reported after the activation of the just-in-time duplicate retrieval feature as these duplicates are more representative of future incoming issue reports and therefore, give a better representation of the future performance of proposed approaches.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper investigates the impact of this just-in-time duplicate retrieval on the duplicate reports that end up in the ITS of an open source project. In particular, we study the differences between duplicate reports for open source projects before and after the activation of this new feature. We show how the experimental results of prior research would vary given the new data after the activation of the just-in-time duplicate retrieval feature. We study duplicate issue reports from the Mozilla-Firefox, Mozilla-Core and Eclipse-Platform projects. In addition, we compare the performance of the state of the art of the automated retrieval of duplicate reports using two popular approaches (i.e., BM25F and REP).
We find that duplicate issue reports after the activation of the just-in-time duplicate retrieval feature are less textually similar, have a greater identification delay and require more discussion to be retrieved as duplicate reports than duplicates before the activation of the feature. Prior work showed that REP outperforms BM25F in terms of Recall rate and Mean average precision. We observe that the performance gap between BM25F and REP becomes even larger after the activation of the just-in-time duplicate retrieval feature. We recommend that future studies focus on duplicates that were reported after the activation of the just-in-time duplicate retrieval feature as these duplicates are more representative of future incoming issue reports and therefore, give a better representation of the future performance of proposed approaches.
Fabian Beck; Alexandre Bergel; Cor-Paul Bezemer; Katherine E. Isaacs
Visualizing systems and software performance - Report on the GI-Dagstuhl seminar for young researchers, July 9-13, 2018 Technical Report
, 2018, ISSN: 2167-9843.
@techreport{vssp2018,
title = {Visualizing systems and software performance - Report on the GI-Dagstuhl seminar for young researchers, July 9-13, 2018},
author = {Fabian Beck and Alexandre Bergel and Cor-Paul Bezemer and Katherine E. Isaacs},
issn = {2167-9843},
year = {2018},
date = {2018-07-09},
urldate = {2018-07-09},
journal = {PeerJ Preprints},
volume = {6},
pages = {e27253v1},
abstract = {This GI-Dagstuhl seminar addressed the problem of visualizing performance-related data of systems and the software that they run. Due to the scale of performance-related data and the open-ended nature of analyzing it, visualization is often the only feasible way to comprehend, improve, and debug the performance behaviour of systems. The rise of cloud and big data systems, and the rapidly growing scale of the performance-related data that they generate, have led to an increased need for visualization of such data. However, the research communities behind data visualization, performance engineering, and high-performance computing are largely disjunct. The goal of this seminar was to bring together young researchers from these research areas to identify cross-community collaboration and to set the path for long-lasting collaborations towards rich and effective visualizations of performance-related data.},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Dayi Lin; Cor-Paul Bezemer; Ying Zou; Ahmed E. Hassan
An Empirical Study of Game Reviews on the Steam Platform Journal Article
Empirical Software Engineering Journal (EMSE), 2018.
Abstract | BibTeX | Tags: Computer games, Game reviews, Steam
@article{Lin2018reviews,
title = {An Empirical Study of Game Reviews on the Steam Platform},
author = {Dayi Lin and Cor-Paul Bezemer and Ying Zou and Ahmed E. Hassan},
year = {2018},
date = {2018-06-15},
urldate = {2018-06-15},
journal = {Empirical Software Engineering Journal (EMSE)},
abstract = {The steadily increasing popularity of computer games has led to the rise of a multi-billion dollar industry. Due to the scale of the computer game industry, developing a successful game is challenging. In addition, prior studies show that gamers are extremely hard to please, making the quality of games an important issue. Most online game stores allow users to review a game that they bought. Such reviews can make or break a game, as other potential buyers often base their purchasing decisions on the reviews of a game. Hence, studying game reviews can help game developers better understand user concerns, and further improve the user-perceived quality of games.
In this paper, we perform an empirical study of the reviews of 6,224 games on the Steam platform, one of the most popular digital game delivery platforms, to better understand if game reviews share similar characteristics with mobile app reviews, and thereby understand whether the conclusions and tools from mobile app review studies can be leveraged by game developers. In addition, new insights from game reviews could possibly open up new research directions for research of mobile app reviews. We first conduct a preliminary study to understand the number of game reviews and the complexity to read through them. In addition, we study the relation between several game-specific characteristics and the fluctuations of the number of reviews that are received on a daily basis. We then focus on the useful information that can be acquired from reviews by studying the major concerns that users express in their reviews, and the amount of play time before players post a review. We find that game reviews are different from mobile app reviews along several aspects. Additionally, the number of playing hours before posting a review is a unique and helpful attribute for developers that is not found in mobile app reviews. Future longitudinal studies should be conducted to help developers and researchers leverage this information. Although negative reviews contain more valuable information about the negative aspects of the game, such as mentioned complaints and bug reports, developers and researchers should also not ignore the potentially useful information in positive reviews. Our study on game reviews serves as a starting point for other game review researchers, and suggests that prior studies on mobile app reviews may need to be revisited.},
keywords = {Computer games, Game reviews, Steam},
pubstate = {published},
tppubtype = {article}
}
In this paper, we perform an empirical study of the reviews of 6,224 games on the Steam platform, one of the most popular digital game delivery platforms, to better understand if game reviews share similar characteristics with mobile app reviews, and thereby understand whether the conclusions and tools from mobile app review studies can be leveraged by game developers. In addition, new insights from game reviews could possibly open up new research directions for research of mobile app reviews. We first conduct a preliminary study to understand the number of game reviews and the complexity to read through them. In addition, we study the relation between several game-specific characteristics and the fluctuations of the number of reviews that are received on a daily basis. We then focus on the useful information that can be acquired from reviews by studying the major concerns that users express in their reviews, and the amount of play time before players post a review. We find that game reviews are different from mobile app reviews along several aspects. Additionally, the number of playing hours before posting a review is a unique and helpful attribute for developers that is not found in mobile app reviews. Future longitudinal studies should be conducted to help developers and researchers leverage this information. Although negative reviews contain more valuable information about the negative aspects of the game, such as mentioned complaints and bug reports, developers and researchers should also not ignore the potentially useful information in positive reviews. Our study on game reviews serves as a starting point for other game review researchers, and suggests that prior studies on mobile app reviews may need to be revisited.
Safwat Hassan; Chakkrit Tantithamthavorn; Cor-Paul Bezemer; Ahmed E. Hassan
Studying the Dialogue Between Users and Developers of Free Apps in the Google Play Store Journal Article
The Empirical Software Engineering Journal (EMSE), 23 (3), pp. 1275–1312, 2018.
@article{safwat16replies,
title = {Studying the Dialogue Between Users and Developers of Free Apps in the Google Play Store},
author = {Safwat Hassan and Chakkrit Tantithamthavorn and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2018},
date = {2018-06-01},
urldate = {2018-06-01},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {23},
number = {3},
pages = {1275--1312},
publisher = {Springer},
abstract = {The popularity of mobile apps continues to grow over the past few years. Mobile app stores, such as the Google Play Store and Apple’s App Store provide a unique user feedback mechanism to app developers through the possibility of posting app reviews. In the Google Play Store (and soon in the Apple App Store), developers are able to respond to such user feedback.
Over the past years, mobile app reviews have been studied excessively by researchers. However, much of prior work (including our own prior work) incorrectly assumes that reviews are static in nature and that users never update their reviews. In a recent study, we started analyzing the dynamic nature of the review-response mechanism. Our previous study showed that responding to a review often has a positive effect on the rating that is given by the user to an app.
In this paper, we revisit our prior finding in more depth by studying 4.5 million reviews with 126,686 responses for 2,328 top free-to-download apps in the Google Play Store. One of the major findings of our paper is that the assumption that reviews are static is incorrect. In particular, we find that developers and users in some cases use this response mechanism as a rudimentary user support tool, where dialogues emerge between users and developers through updated reviews and responses. Even though the messages are often simple, we find instances of as many as ten user-developer back-and-forth messages that occur via the response mechanism.
Using a mixed-effect model, we identify that the likelihood of a developer responding to a review increases as the review rating gets lower or as the review content gets longer. In addition, we identify four patterns of developers: 1) developers who primarily respond to only negative reviews, 2) developers who primarily respond to negative reviews or to reviews based on their contents, 3) developers who primarily respond to reviews which are posted shortly after the latest release of their app, and 4) developers who primarily respond to reviews which are posted long after the latest release of their app.
We perform a qualitative analysis of developer responses to understand what drives developers to respond to a review. We manually analyzed a statistically representative random sample of 347 reviews with responses for the top ten apps with the highest number of developer responses. We identify seven drivers that make a developer respond to a review, of which the most important ones are to thank the users for using the app and to ask the user for more details about the reported issue.
Our findings show that it can be worthwhile for app owners to respond to reviews, as responding may lead to an increase in the given rating. In addition, our findings show that studying the dialogue between user and developer can provide valuable insights that can lead to improvements in the app store and user support process.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Over the past years, mobile app reviews have been studied excessively by researchers. However, much of prior work (including our own prior work) incorrectly assumes that reviews are static in nature and that users never update their reviews. In a recent study, we started analyzing the dynamic nature of the review-response mechanism. Our previous study showed that responding to a review often has a positive effect on the rating that is given by the user to an app.
In this paper, we revisit our prior finding in more depth by studying 4.5 million reviews with 126,686 responses for 2,328 top free-to-download apps in the Google Play Store. One of the major findings of our paper is that the assumption that reviews are static is incorrect. In particular, we find that developers and users in some cases use this response mechanism as a rudimentary user support tool, where dialogues emerge between users and developers through updated reviews and responses. Even though the messages are often simple, we find instances of as many as ten user-developer back-and-forth messages that occur via the response mechanism.
Using a mixed-effect model, we identify that the likelihood of a developer responding to a review increases as the review rating gets lower or as the review content gets longer. In addition, we identify four patterns of developers: 1) developers who primarily respond to only negative reviews, 2) developers who primarily respond to negative reviews or to reviews based on their contents, 3) developers who primarily respond to reviews which are posted shortly after the latest release of their app, and 4) developers who primarily respond to reviews which are posted long after the latest release of their app.
We perform a qualitative analysis of developer responses to understand what drives developers to respond to a review. We manually analyzed a statistically representative random sample of 347 reviews with responses for the top ten apps with the highest number of developer responses. We identify seven drivers that make a developer respond to a review, of which the most important ones are to thank the users for using the app and to ask the user for more details about the reported issue.
Our findings show that it can be worthwhile for app owners to respond to reviews, as responding may lead to an increase in the given rating. In addition, our findings show that studying the dialogue between user and developer can provide valuable insights that can lead to improvements in the app store and user support process.
Dayi Lin; Cor-Paul Bezemer; Ahmed E. Hassan
An Empirical Study of Early Access Games on the Steam Platform Journal Article
The Empirical Software Engineering Journal (EMSE), 23 (2), pp. 771–799, 2018.
Abstract | BibTeX | Tags: Computer games, Early access games, Steam
@article{Lin16eag,
title = {An Empirical Study of Early Access Games on the Steam Platform},
author = {Dayi Lin and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2018},
date = {2018-04-01},
urldate = {2018-04-01},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {23},
number = {2},
pages = {771--799},
publisher = {Springer},
abstract = {“Early access†is a release strategy for software that allows consumers to purchase an unfinished version of the software. In turn, consumers can influence the software development process by giving developers early feedback. This early access model has become increasingly popular through digital distribution platforms, such as Steam which is the most popular distribution platform for games. The plethora of options offered by Steam to communicate between developers and game players contribute to the popularity of the early access model. The model is considered a success by the game development community as several games using this approach have gained a large user base (i.e., owners) and high sales. On the other hand, the benefits of the early access model have been questioned as well.
In this paper, we conduct an empirical study on 1,182 Early Access Games (EAGs) on the Steam platform to understand the characteristics, advantages and limitations of the early access model. We find that 15% of the games on Steam make use of the early access model, with the most popular EAG having as many as 29 million owners. 88% of the EAGs are classified by their developers as so-called “indie†games, indicating that most EAGs are developed by individual developers or small studios.
We study the interaction between players and developers of EAGs and the Steam platform. We observe that on the one hand, developers update their games more frequently in the early access stage. On the other hand, the percentage of players that review a game during its early access stage is lower than the percentage of players that review the game after it leaves the early access stage. However, the average rating of the reviews is much higher during the early access stage, suggesting that players are more tolerant of imperfections in the early access stage. The positive review rate does not correlate with the length or the game update frequency of the early access stage.
Based on our findings, we suggest game developers to use the early access model as a method for eliciting early feedback and more positive reviews to attract additional new players. In addition, our findings suggest that developers can determine their release schedule without worrying about the length of the early access stage and the game update frequency during the early access stage.},
keywords = {Computer games, Early access games, Steam},
pubstate = {published},
tppubtype = {article}
}
In this paper, we conduct an empirical study on 1,182 Early Access Games (EAGs) on the Steam platform to understand the characteristics, advantages and limitations of the early access model. We find that 15% of the games on Steam make use of the early access model, with the most popular EAG having as many as 29 million owners. 88% of the EAGs are classified by their developers as so-called “indie†games, indicating that most EAGs are developed by individual developers or small studios.
We study the interaction between players and developers of EAGs and the Steam platform. We observe that on the one hand, developers update their games more frequently in the early access stage. On the other hand, the percentage of players that review a game during its early access stage is lower than the percentage of players that review the game after it leaves the early access stage. However, the average rating of the reviews is much higher during the early access stage, suggesting that players are more tolerant of imperfections in the early access stage. The positive review rate does not correlate with the length or the game update frequency of the early access stage.
Based on our findings, we suggest game developers to use the early access model as a method for eliciting early feedback and more positive reviews to attract additional new players. In addition, our findings suggest that developers can determine their release schedule without worrying about the length of the early access stage and the game update frequency during the early access stage.
Hanyang Hu; Cor-Paul Bezemer; Ahmed E. Hassan
Studying the Consistency of Star Ratings and Complaints in 1 & 2-Star User Reviews for Top Free Cross-Platform Android and iOS Apps Journal Article
The Empirical Software Engineering (EMSE) journal, 23 (6), pp. 3442–3475, 2018.
Abstract | BibTeX | Tags: Mobile apps, Star rating, User reviews
@article{hu16crossplatform,
title = {Studying the Consistency of Star Ratings and Complaints in 1 & 2-Star User Reviews for Top Free Cross-Platform Android and iOS Apps},
author = {Hanyang Hu and Cor-Paul Bezemer and Ahmed E. Hassan},
year = {2018},
date = {2018-03-22},
urldate = {2018-03-22},
journal = {The Empirical Software Engineering (EMSE) journal},
volume = {23},
number = {6},
pages = {3442--3475},
publisher = {Springer},
abstract = {How users rate a mobile app via star ratings and user reviews is of utmost importance for the success of an app. Recent studies and surveys show that users rely heavily on star ratings and user reviews that are provided by other users, for deciding which app to download. However, understanding star ratings and user reviews is a complicated matter, since they are influenced by many factors such as the actual quality of the app and how the user perceives such quality relative to their expectations, which are in turn influenced by their prior experiences and expectations relative to other apps on the platform (e.g., iOS versus Android). Nevertheless, star ratings and user reviews provide developers with valuable information for improving the software quality of their app.
In an effort to expand their revenue and reach more users, app developers commonly build cross-platform apps, i.e., apps that are available on multiple platforms. As star ratings and user reviews are of such importance in the mobile app industry, it is essential for developers of cross-platform apps to maintain a consistent level of star ratings and user reviews for their apps across the various platforms on which they are available.
In this paper, we investigate whether cross-platform apps achieve a consistent level of star ratings and user reviews. We manually identify 19 cross-platform apps and conduct an empirical study on their star ratings and user reviews. By manually tagging 9,902 1 & 2-star reviews of the studied cross-platform apps, we discover that the distribution of the frequency of complaint types varies across platforms. Finally, we study the negative impact ratio of complaint types and find that for some apps, users have higher expectations on one platform. All our proposed techniques and our methodologies are generic and can be used for any app. Our findings show that at least 68% of the studied cross-platform apps do not have consistent star ratings, which suggests that different quality assurance efforts need to be considered by developers for the different platforms that they wish to support.},
keywords = {Mobile apps, Star rating, User reviews},
pubstate = {published},
tppubtype = {article}
}
In an effort to expand their revenue and reach more users, app developers commonly build cross-platform apps, i.e., apps that are available on multiple platforms. As star ratings and user reviews are of such importance in the mobile app industry, it is essential for developers of cross-platform apps to maintain a consistent level of star ratings and user reviews for their apps across the various platforms on which they are available.
In this paper, we investigate whether cross-platform apps achieve a consistent level of star ratings and user reviews. We manually identify 19 cross-platform apps and conduct an empirical study on their star ratings and user reviews. By manually tagging 9,902 1 & 2-star reviews of the studied cross-platform apps, we discover that the distribution of the frequency of complaint types varies across platforms. Finally, we study the negative impact ratio of complaint types and find that for some apps, users have higher expectations on one platform. All our proposed techniques and our methodologies are generic and can be used for any app. Our findings show that at least 68% of the studied cross-platform apps do not have consistent star ratings, which suggests that different quality assurance efforts need to be considered by developers for the different platforms that they wish to support.
Suhas Kabinna; Cor-Paul Bezemer; Weiyi Shang; Mark D. Syer; Ahmed E. Hassan
Examining the Stability of Logging Statements Journal Article
The Empirical Software Engineering (EMSE) journal, 23 (1), pp. 290–333, 2018.
Abstract | BibTeX | Tags: Log file stability, Log processing tools, Logging statements
@article{kabinna16emse,
title = {Examining the Stability of Logging Statements},
author = {Suhas Kabinna and Cor-Paul Bezemer and Weiyi Shang and Mark D. Syer and Ahmed E. Hassan},
year = {2018},
date = {2018-02-01},
urldate = {2018-02-01},
journal = {The Empirical Software Engineering (EMSE) journal},
volume = {23},
number = {1},
pages = {290--333},
publisher = {Springer},
abstract = {Logging statements (embedded in the source code) produce logs that assist in understanding system behavior, monitoring choke-points and debugging. Prior work showcases the importance of logging statements in operating, understanding and improving software systems. The wide dependence on logs has lead to a new market of log processing and management tools. However, logs are often unstable, i.e., the logging statements that generate logs are often changed without the consideration of other stakeholders, causing sudden failures of log processing tools and increasing the maintenance costs of such tools. We examine the stability of logging statements in four open source applications namely: Liferay, ActiveMQ, Camel and CloudStack. We find that 20–45% of their logging statements change throughout their lifetime. The median number of days between the introduction of a logging statement and the first change to that statement is between 1 and 17 in our studied applications. These numbers show that in order to reduce maintenance effort, developers of log processing tools must be careful when selecting the logging statements on which their tools depend.
In order to effectively mitigate the issues that are caused by unstable logging statements, we make an important first step towards determining whether a logging statement is likely to remain unchanged in the future. First, we use a random forest classifier to determine whether a just-introduced logging statement will change in the future, based solely on metrics that are calculated when it is introduced. Second, we examine whether a long-lived logging statement is likely to change based on its change history. We leverage Cox proportional hazards models (Cox models) to determine the change risk of long-lived logging statements in the source code. Through our case study on four open source applications, we show that our random forest classifier achieves a 83–91% precision, a 65–85% recall and a 0.95–0.96 AUC. We find that file ownership, developer experience, log density and SLOC are important metrics in our studied projects for determining the stability of logging statements in both our random forest classifiers and Cox models. Developers can use our approach to determine the risk of a logging statement changing in their own projects, to construct more robust log processing tools, by ensuring that these tools depend on logs that are generated by more stable logging statements.},
keywords = {Log file stability, Log processing tools, Logging statements},
pubstate = {published},
tppubtype = {article}
}
In order to effectively mitigate the issues that are caused by unstable logging statements, we make an important first step towards determining whether a logging statement is likely to remain unchanged in the future. First, we use a random forest classifier to determine whether a just-introduced logging statement will change in the future, based solely on metrics that are calculated when it is introduced. Second, we examine whether a long-lived logging statement is likely to change based on its change history. We leverage Cox proportional hazards models (Cox models) to determine the change risk of long-lived logging statements in the source code. Through our case study on four open source applications, we show that our random forest classifier achieves a 83–91% precision, a 65–85% recall and a 0.95–0.96 AUC. We find that file ownership, developer experience, log density and SLOC are important metrics in our studied projects for determining the stability of logging statements in both our random forest classifiers and Cox models. Developers can use our approach to determine the risk of a logging statement changing in their own projects, to construct more robust log processing tools, by ensuring that these tools depend on logs that are generated by more stable logging statements.
Cor-Paul Bezemer; Shane McIntosh; Bram Adams; Daniel M. German; Ahmed E. Hassan
An Empirical Study of Unspecified Dependencies in Make-Based Build Systems Journal Article
The Empirical Software Engineering Journal (EMSE), 22 (6), pp. 3117–3148, 2017.
Abstract | BibTeX | Tags: Build systems, Unspecified dependencies
@article{bezemer16unspecified,
title = {An Empirical Study of Unspecified Dependencies in Make-Based Build Systems},
author = {Cor-Paul Bezemer and Shane McIntosh and Bram Adams and Daniel M. German and Ahmed E. Hassan},
year = {2017},
date = {2017-12-01},
urldate = {2017-12-01},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {22},
number = {6},
pages = {3117--3148},
publisher = {Springer},
abstract = {Software developers rely on a build system to compile their source code changes and produce deliverables for testing and deployment. Since the full build of large software systems can take hours, the incremental build is a cornerstone of modern build systems. Incremental builds should only recompile deliverables whose dependencies have been changed by a developer. However, in many organizations, such dependencies still are identified by build rules that are specified and maintained (mostly) manually, typically using technologies like make. Incomplete rules lead to unspecified dependencies that can prevent certain deliverables from being rebuilt, yielding incomplete results, which leave sources and deliverables out-of-sync. In this paper, we present a case study on unspecified dependencies
in the make-based build systems of the GLIB, OPENLDAP, LINUX and QT open source projects. To uncover unspecified dependencies in make-based build systems, we use an approach that combines a conceptual model of the dependencies specified in the build system with a concrete model of the files and processes that are actually exercised during the build. Our approach provides an overview of the dependencies that are used throughout the build system and reveals unspecified dependencies that are not yet expressed in the build system rules. During our analysis, we find that unspecified dependencies are common. We identify 6 common causes in more than 1.2 million unspecified dependencies.},
keywords = {Build systems, Unspecified dependencies},
pubstate = {published},
tppubtype = {article}
}
in the make-based build systems of the GLIB, OPENLDAP, LINUX and QT open source projects. To uncover unspecified dependencies in make-based build systems, we use an approach that combines a conceptual model of the dependencies specified in the build system with a concrete model of the files and processes that are actually exercised during the build. Our approach provides an overview of the dependencies that are used throughout the build system and reveals unspecified dependencies that are not yet expressed in the build system rules. During our analysis, we find that unspecified dependencies are common. We identify 6 common causes in more than 1.2 million unspecified dependencies.
Mojtaba Bagherzadeh; Nafiseh Kahani; Cor-Paul Bezemer; Ahmed E. Hassan; Juergen Dingel; James R. Cordy
Analyzing a Decade of Linux System Calls Journal Article
The Empirical Software Engineering Journal (EMSE), 23 (3), pp. 1519–1551, 2017.
@article{moji16systemcalls,
title = {Analyzing a Decade of Linux System Calls},
author = {Mojtaba Bagherzadeh and Nafiseh Kahani and Cor-Paul Bezemer and Ahmed E. Hassan and Juergen Dingel and James R. Cordy},
year = {2017},
date = {2017-10-13},
urldate = {2017-10-13},
journal = {The Empirical Software Engineering Journal (EMSE)},
volume = {23},
number = {3},
pages = {1519--1551},
abstract = {Over the past 25 years, thousands of developers have contributed more than 18 million lines of code (LOC) to the Linux kernel. As the Linux kernel forms the central part of various operating systems that are used by millions of users, the kernel must be continuously adapted to the changing demands and expectations of these users. The Linux kernel provides its services to an application through system calls. The combined set of all system calls forms the essential Application Programming Interface (API) through which an application interacts with the kernel.
In this paper, we conduct an empirical study of 8,770 changes that were made to Linux system calls during the last decade (i.e., from April 2005 to December 2014). In particular, we study the size of the changes, and we manually identify the type of changes and bug fixes that were made.
Our analysis provides an overview of the evolution of the Linux system calls over the last decade. We find that there was a considerable amount of technical debt in the kernel, that was addressed by adding a number of sibling calls (i.e., 26% of all system calls). In addition, we find that by far, the ptrace() and signal handling system calls are the most challenging to maintain.
Our study can be used by developers who want to improve the design and
ensure the successful evolution of their own kernel APIs.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
In this paper, we conduct an empirical study of 8,770 changes that were made to Linux system calls during the last decade (i.e., from April 2005 to December 2014). In particular, we study the size of the changes, and we manually identify the type of changes and bug fixes that were made.
Our analysis provides an overview of the evolution of the Linux system calls over the last decade. We find that there was a considerable amount of technical debt in the kernel, that was addressed by adding a number of sibling calls (i.e., 26% of all system calls). In addition, we find that by far, the ptrace() and signal handling system calls are the most challenging to maintain.
Our study can be used by developers who want to improve the design and
ensure the successful evolution of their own kernel APIs.