( = Paper PDF, = Presentation slides, = Presentation video)
Simon Eismann; Diego Costa; Lizhi Liao; Cor-Paul Bezemer; Weiyi Shang; André van Hoorn; Samuel Kounev
A Case Study on the Stability of Performance Tests for Serverless Applications Journal Article
Journal of Systems and Software, 2022.
Abstract | BibTeX | Tags: Performance engineering, Performance regressions, Performance testing, Serverless
@article{EismannJSS2022,
title = {A Case Study on the Stability of Performance Tests for Serverless Applications},
author = {Simon Eismann and Diego Costa and Lizhi Liao and Cor-Paul Bezemer and Weiyi Shang and André van Hoorn and Samuel Kounev},
year = {2022},
date = {2022-03-17},
urldate = {2022-03-17},
journal = {Journal of Systems and Software},
abstract = {Context. While in serverless computing, application resource management and operational concerns are generally delegated to the cloud provider, ensuring that serverless applications meet their performance requirements is still a responsibility of the developers. Performance testing is a commonly used performance assessment practice; however, it traditionally requires visibility of the resource environment.
Objective. In this study, we investigate whether performance tests of serverless applications are stable, that is, if their results are reproducible, and what implications the serverless paradigm has for performance tests.
Method. We conduct a case study where we collect two datasets of performance test results: (a) repetitions of performance tests for varying memory size and load intensities and (b) three repetitions of the same performance test every day for ten months.
Results. We find that performance tests of serverless applications are comparatively stable if conducted on the same day. However, we also observe short-term performance variations and frequent long-term performance changes.
Conclusion. Performance tests for serverless applications can be stable; however, the serverless model impacts the planning, execution, and analysis of performance tests.},
keywords = {Performance engineering, Performance regressions, Performance testing, Serverless},
pubstate = {published},
tppubtype = {article}
}
Objective. In this study, we investigate whether performance tests of serverless applications are stable, that is, if their results are reproducible, and what implications the serverless paradigm has for performance tests.
Method. We conduct a case study where we collect two datasets of performance test results: (a) repetitions of performance tests for varying memory size and load intensities and (b) three repetitions of the same performance test every day for ten months.
Results. We find that performance tests of serverless applications are comparatively stable if conducted on the same day. However, we also observe short-term performance variations and frequent long-term performance changes.
Conclusion. Performance tests for serverless applications can be stable; however, the serverless model impacts the planning, execution, and analysis of performance tests.
Hammam M. AlGhamdi; Cor-Paul Bezemer; Weiyi Shang; Ahmed E. Hassan; Parminder Flora
Towards Reducing the Time Needed for Load Testing Journal Article
Journal of Software Evolution and Process (JSEP), 2020.
Abstract | BibTeX | Tags: Load testing, Performance analysis, Performance testing
@article{AlGhamdi2020loadtests,
title = {Towards Reducing the Time Needed for Load Testing},
author = {Hammam M. AlGhamdi and Cor-Paul Bezemer and Weiyi Shang and Ahmed E. Hassan and Parminder Flora},
year = {2020},
date = {2020-05-12},
urldate = {2020-05-12},
journal = {Journal of Software Evolution and Process (JSEP)},
abstract = {The performance of large-scale systems must be thoroughly tested under various levels of workload, as load-related issues can have a disastrous impact on the system. However, load tests often require a large amount of time, running from hours to even days, to execute. Nowadays, with the increased popularity of rapid releases and continuous deployment, testing time is at a premium and should be minimized while still delivering a complete test of the system. In our prior work, we proposed to reduce the execution time of a load test by detecting repetitiveness in individual performance metric values, such as CPU utilization or memory usage, that are observed during the test. However, as we explain in this paper, disregarding combinations of performance metrics may miss important information about the load-related behaviour of a system.
Therefore, in this paper we revisit our prior approach, by proposing a new approach that reduces the execution time of a load test by detecting whether a test no longer exercises new combinations of the observed performance metrics. We conduct an experimental case study on three open source systems (CloudStore, PetClinic, and Dell DVD Store 2), in which we use our new and prior approaches to reduce the execution time of a 24-hour load test. We show that our new approach is capable of reducing the execution time of the test to less than 8.5 hours, while preserving a coverage of at least 95% of the combinations that are observed between the performance metrics during the 24-hour tests. In addition, we show that our prior approach recommends a stopping time that is too early for two of the three studied systems. Finally, we discuss the challenges of applying our approach to an industrial setting, and we call upon the community to help us to address these challenges.},
keywords = {Load testing, Performance analysis, Performance testing},
pubstate = {published},
tppubtype = {article}
}
Therefore, in this paper we revisit our prior approach, by proposing a new approach that reduces the execution time of a load test by detecting whether a test no longer exercises new combinations of the observed performance metrics. We conduct an experimental case study on three open source systems (CloudStore, PetClinic, and Dell DVD Store 2), in which we use our new and prior approaches to reduce the execution time of a 24-hour load test. We show that our new approach is capable of reducing the execution time of the test to less than 8.5 hours, while preserving a coverage of at least 95% of the combinations that are observed between the performance metrics during the 24-hour tests. In addition, we show that our prior approach recommends a stopping time that is too early for two of the three studied systems. Finally, we discuss the challenges of applying our approach to an industrial setting, and we call upon the community to help us to address these challenges.
Diego Costa; Cor-Paul Bezemer; Philipp Leitner; Artur Andrzejak
What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks Journal Article
The Transactions of Software Engineering Journal (TSE), 2019.
Abstract | BibTeX | Tags: Bad practices, JMH, Microbenchmarking, Performance testing, Static analysis
@article{diego_tse,
title = {What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks},
author = {Diego Costa and Cor-Paul Bezemer and Philipp Leitner and Artur Andrzejak},
year = {2019},
date = {2019-06-17},
urldate = {2019-06-17},
journal = {The Transactions of Software Engineering Journal (TSE)},
publisher = {IEEE},
abstract = {Microbenchmarking frameworks, such as Java’s Microbenchmark Harness (JMH), allow developers to write fine-grained performance test suites at the method or statement level. However, due to the complexities of the Java Virtual Machine, developers often struggle with writing expressive JMH benchmarks which accurately represent the performance of such methods or statements. In this paper, we empirically study bad practices of JMH benchmarks. We present a tool that leverages static analysis to identify 5 bad JMH practices. Our empirical study of 123 open source Java-based systems shows that each of these 5 bad practices are prevalent in open source software. Further, we conduct several experiments to quantify the impact of each bad practice in multiple case studies, and find that bad practices often significantly impact the benchmark results. To validate our experimental results, we constructed seven patches that fix the identified bad practices for six of the studied open source projects, of which six were merged into the main branch of the project. In this paper, we show that developers struggle with accurate Java microbenchmarking, and provide several recommendations to developers of microbenchmarking frameworks on how to improve future versions of their framework.},
keywords = {Bad practices, JMH, Microbenchmarking, Performance testing, Static analysis},
pubstate = {published},
tppubtype = {article}
}
Philipp Leitner; Cor-Paul Bezemer
An Exploratory Study of the State of Practice of Performance Testing in Java-based Open Source Projects Inproceedings
The International Conference on Performance Engineering (ICPE), pp. 373–384, ACM/SPEC, 2017.
Abstract | BibTeX | Tags: Empirical software engineering, Mining software repositories, Open source, Performance engineering, Performance testing
@inproceedings{leitner16oss,
title = {An Exploratory Study of the State of Practice of Performance Testing in Java-based Open Source Projects},
author = {Philipp Leitner and Cor-Paul Bezemer},
year = {2017},
date = {2017-04-22},
urldate = {2017-04-22},
booktitle = {The International Conference on Performance Engineering (ICPE)},
pages = {373--384},
publisher = {ACM/SPEC},
abstract = {The usage of open source (OS) software is nowadays widespread across many industries and domains. While the functional quality of OS projects is considered to be up to par with that of closed-source software, much is unknown about the quality in terms of non-functional attributes, such as
performance. One challenge for OS developers is that, unlike for functional testing, there is a lack of accepted best practices for performance testing.
To reveal the state of practice of performance testing in OS projects, we conduct an exploratory study on 111 Java-based OS projects from GitHub. We study the performance tests of these projects from five perspectives: (1) the developers, (2) size, (3) organization and (4) types of performance tests
and (5) the tooling used for performance testing.
First, in a quantitative study we show that writing performance tests is not a popular task in OS projects: performance tests form only a small portion of the test suite, are rarely updated, and are usually maintained by a small group of core project developers. Second, we show through a qualitative study that even though many projects are aware that they need performance tests, developers appear to struggle implementing them. We argue that future performance testing frameworks should provider better support for low-friction testing, for instance via non-parameterized methods
or performance test generation, as well as focus on a tight integration with standard continuous integration tooling.},
keywords = {Empirical software engineering, Mining software repositories, Open source, Performance engineering, Performance testing},
pubstate = {published},
tppubtype = {inproceedings}
}
performance. One challenge for OS developers is that, unlike for functional testing, there is a lack of accepted best practices for performance testing.
To reveal the state of practice of performance testing in OS projects, we conduct an exploratory study on 111 Java-based OS projects from GitHub. We study the performance tests of these projects from five perspectives: (1) the developers, (2) size, (3) organization and (4) types of performance tests
and (5) the tooling used for performance testing.
First, in a quantitative study we show that writing performance tests is not a popular task in OS projects: performance tests form only a small portion of the test suite, are rarely updated, and are usually maintained by a small group of core project developers. Second, we show through a qualitative study that even though many projects are aware that they need performance tests, developers appear to struggle implementing them. We argue that future performance testing frameworks should provider better support for low-friction testing, for instance via non-parameterized methods
or performance test generation, as well as focus on a tight integration with standard continuous integration tooling.