1,208 results on '"Harman, Mark"'
Search Results
2. Borges Translates Joyce Who Translates Himself
- Author
-
Harman, Mark
- Published
- 2022
- Full Text
- View/download PDF
3. Enhancing Testing at Meta with Rich-State Simulated Populations
- Author
-
Alshahwan, Nadia, Blasi, Arianna, Bojarczuk, Kinga, Ciancone, Andrea, Gucevska, Natalija, Harman, Mark, Schellaert, Simon, Harper, Inna, Jia, Yue, Królikowski, Michał, Lewis, Will, Martac, Dragos, Rojas, Rubmary, and Ustiuzhanina, Kate
- Subjects
Computer Science - Software Engineering - Abstract
This paper reports the results of the deployment of Rich-State Simulated Populations at Meta for both automated and manual testing. We use simulated users (aka test users) to mimic user interactions and acquire state in much the same way that real user accounts acquire state. For automated testing, we present empirical results from deployment on the Facebook, Messenger, and Instagram apps for iOS and Android Platforms. These apps consist of tens of millions of lines of code, communicating with hundreds of millions of lines of backend code, and are used by over 2 billion people every day. Our results reveal that rich state increases average code coverage by 38\%, and endpoint coverage by 61\%. More importantly, it also yields an average increase of 115\% in the faults found by automated testing. The rich-state test user populations are also deployed in a (continually evolving) Test Universe; a web-enabled simulation platform for privacy-safe manual testing, which has been used by over 21,000 Meta engineers since its deployment in November 2022., Comment: ICSE 2024
- Published
- 2024
4. Automated Unit Test Improvement using Large Language Models at Meta
- Author
-
Alshahwan, Nadia, Chheda, Jubin, Finegenova, Anastasia, Gokkaya, Beliz, Harman, Mark, Harper, Inna, Marginean, Alexandru, Sengupta, Shubho, and Wang, Eddy
- Subjects
Computer Science - Software Engineering - Abstract
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement., Comment: 12 pages, 8 figures, 32nd ACM Symposium on the Foundations of Software Engineering (FSE 24)
- Published
- 2024
5. Observation-based unit test generation at Meta
- Author
-
Alshahwan, Nadia, Harman, Mark, Marginean, Alexandru, Tal, Rotem, and Wang, Eddy
- Subjects
Computer Science - Software Engineering - Abstract
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. We describe the development and deployment of TestGen at Meta. In particular, we focus on the scalability challenges overcome during development in order to deploy observation-based test carving at scale in industry. So far, TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Meta is currently in the process of more widespread deployment. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86\% of the classes covered by end-to-end tests. Testing on 16 Kotlin Instagram app-launch-blocking tasks demonstrated that the TestGen tests would have trapped 13 of these before they became launch blocking., Comment: 12 pages, 8 figures, FSE 2024, Mon 15 - Fri 19 July 2024, Porto de Galinhas, Brazil
- Published
- 2024
6. Assured LLM-Based Software Engineering
- Author
-
Alshahwan, Nadia, Harman, Mark, Harper, Inna, Marginean, Alexandru, Sengupta, Shubho, and Wang, Eddy
- Subjects
Computer Science - Software Engineering - Abstract
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code - does not regress the properties of the original code? - improves the original in a verifiable and measurable way? To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers. This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal., Comment: 6 pages, 1 figure, InteNSE 24: ACM International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, April, 2024, Lisbon, Portugal
- Published
- 2024
7. Large Language Models for Software Engineering: Survey and Open Problems
- Author
-
Fan, Angela, Gokkaya, Beliz, Harman, Mark, Lyubarskiy, Mitya, Sengupta, Shubho, Yoo, Shin, and Zhang, Jie M.
- Subjects
Computer Science - Software Engineering - Abstract
This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics. However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations. Our survey reveals the pivotal role that hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE.
- Published
- 2023
8. Large Language Models in Fault Localisation
- Author
-
Wu, Yonghao, Li, Zheng, Zhang, Jie M., Papadakis, Mike, Harman, Mark, and Liu, Yong
- Subjects
Computer Science - Software Engineering - Abstract
Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, program repair, code summarisation, and test generation. Fault localisation is instrumental in enabling automated debugging and repair of programs and was prominently featured as a highlight during the launch event of ChatGPT-4. Nevertheless, the performance of LLMs compared to state-of-the-art methods, as well as the impact of prompt design and context length on their efficacy, remains unclear. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted large-scale Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the consistency of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and consistency, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL on the Defects4J dataset in terms of TOP-1 metric. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop, with 49.9% lower accuracy than SmartFL under TOP-1 metric. These observations indicate that although ChatGPT can effectively localise faults under specific conditions, limitations are evident. Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
- Published
- 2023
9. COCO: Testing Code Generation Systems via Concretized Instructions
- Author
-
Yan, Ming, Chen, Junjie, Zhang, Jie M., Cao, Xuejie, Yang, Chen, and Harman, Mark
- Subjects
Computer Science - Software Engineering - Abstract
Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly different instructions can result in significantly different code semantics. Robustness is critical for code generation systems, as it can have significant impacts on software development, software quality, and trust in the generated code. Although existing testing techniques for general text-to-text software can detect some robustness issues, they are limited in effectiveness due to ignoring the characteristics of code generation systems. In this work, we propose a novel technique COCO to test the robustness of code generation systems. It exploits the usage scenario of code generation systems to make the original programming instruction more concrete by incorporating features known to be contained in the original code. A robust system should maintain code semantics for the concretized instruction, and COCO detects robustness inconsistencies when it does not. We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT, using two widely-used datasets. Our results demonstrate the effectiveness of COCO in testing the robustness of code generation systems, outperforming two techniques adopted from general text-to-text software testing by 466.66% and 104.02%, respectively. Furthermore, concretized instructions generated by COCO can help reduce robustness inconsistencies by 18.35% to 53.91% through fine-tuning.
- Published
- 2023
10. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation
- Author
-
Ouyang, Shuyin, Zhang, Jie M., Harman, Mark, and Wang, Meng
- Subjects
Computer Science - Software Engineering - Abstract
There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of coding tasks with zero equal test output across different requests is 72.73%, 60.40%, and 65.85% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusions.
- Published
- 2023
11. Fairness Improvement with Multiple Protected Attributes: How Far Are We?
- Author
-
Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, and Harman, Mark
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computers and Society ,Computer Science - Software Engineering - Abstract
Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate., Comment: Accepted by the 46th International Conference on Software Engineering (ICSE 2024). Please include ICSE in any citations
- Published
- 2023
12. Simulation-Driven Automated End-to-End Test and Oracle Inference
- Author
-
Tuli, Shreshth, Bojarczuk, Kinga, Gucevska, Natalija, Harman, Mark, Wang, Xiao-Yu, and Wright, Graham
- Subjects
Computer Science - Software Engineering - Abstract
This is the first work to report on inferential testing at scale in industry. Specifically, it reports the experience of automated testing of integrity systems at Meta. We built an internal tool called ALPACAS for automated inference of end-to-end integrity tests. Integrity tests are designed to keep users safe online by checking that interventions take place when harmful behaviour occurs on a platform. ALPACAS infers not only the test input, but also the oracle, by observing production interventions to prevent harmful behaviour. This approach allows Meta to automate the process of generating integrity tests for its platforms, such as Facebook and Instagram, which consist of hundreds of millions of lines of production code. We outline the design and deployment of ALPACAS, and report results for its coverage, number of tests produced at each stage of the test inference process, and their pass rates. Specifically, we demonstrate that using ALPACAS significantly improves coverage from a manual test design for the particular aspect of integrity end-to-end testing it was applied to. Further, from a pool of 3 million data points, ALPACAS automatically yields 39 production-ready end-to-end integrity tests. We also report that the ALPACAS-inferred test suite enjoys exceptionally low flakiness for end-to-end testing with its average in-production pass rate of 99.84%., Comment: Accepted in ICSE 2023 (SEIP Track)
- Published
- 2023
13. The German Joyce by Robert K. Weninger (review)
- Author
-
Harman, Mark
- Published
- 2014
- Full Text
- View/download PDF
14. Robert Walser: Writing On The Periphery
- Author
-
Harman, Mark
- Published
- 2008
- Full Text
- View/download PDF
15. Keeping Mutation Test Suites Consistent and Relevant with Long-Standing Mutants
- Author
-
Ojdanic, Milos, Papadakis, Mike, and Harman, Mark
- Subjects
Computer Science - Software Engineering - Abstract
Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester's tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code change means that a mutant suite's relevance will naturally degrade over time. We measure this degradation in relevance for 143,500 mutants in 4 non-trivial systems finding that, on overage, 52% degrade. We introduce a mutant brittleness measure and use it to audit software systems and their mutation suites. We also demonstrate how consistent-by-construction long-standing mutant suites can be identified with a 10x improvement in mutant relevance over an arbitrary test suite. Our results indicate that the research community should avoid the re-computation of mutant suites and focus, instead, on long-standing mutants, thereby improving the consistency and relevance of mutation testing.
- Published
- 2022
16. Fairness Testing: A Comprehensive Survey and Analysis of Trends
- Author
-
Chen, Zhenpeng, Zhang, Jie M., Hort, Max, Harman, Mark, and Sarro, Federica
- Subjects
Computer Science - Software Engineering - Abstract
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely-adopted datasets and open-source tools for fairness testing., Comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM 2024). Please include TOSEM in any citations
- Published
- 2022
17. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
- Author
-
Hort, Max, Chen, Zhenpeng, Zhang, Jie M., Harman, Mark, and Sarro, Federica
- Subjects
Computer Science - Machine Learning - Abstract
This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., What is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?), we hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods., Comment: 52 pages, 7 figures
- Published
- 2022
18. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers
- Author
-
Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, and Harman, Mark
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence - Abstract
Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s)., Comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM 2023). Please include TOSEM in any citations
- Published
- 2022
19. "Digging the Pit of Babel": Retranslating Franz Kafka's Castle
- Author
-
Harman, Mark
- Published
- 1996
- Full Text
- View/download PDF
20. Search-based Automatic Repair for Fairness and Accuracy in Decision-making Software
- Author
-
Hort, Max, Zhang, Jie M., Sarro, Federica, and Harman, Mark
- Published
- 2024
- Full Text
- View/download PDF
21. Mutation analysis for evaluating code translation
- Author
-
Guizzo, Giovani, Zhang, Jie M., Sarro, Federica, Treude, Christoph, and Harman, Mark
- Published
- 2024
- Full Text
- View/download PDF
22. Leveraging Automated Unit Tests for Unsupervised Code Translation
- Author
-
Roziere, Baptiste, Zhang, Jie M., Charton, Francois, Harman, Mark, Synnaeve, Gabriel, and Lample, Guillaume
- Subjects
Computer Science - Software Engineering ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
- Published
- 2021
23. An Empirical Study on Failed Error Propagation in Java Programs with Real Faults
- Author
-
Jahangirova, Gunel, Clark, David, Harman, Mark, and Tonella, Paolo
- Subjects
Computer Science - Software Engineering - Abstract
During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted. In such a case, the oracle is subject to failed error propagation. We conducted an empirical study to measure failed error propagation on Defects4J, the reference benchmark for Java programs with real faults, considering all 6 projects available (386 real bugs and 459 fixed methods). Our results indicate that the prevalence of failed error propagation is negligible when testing is performed at the unit level. However, when system-level inputs are provided, the prevalence of failed error propagation increases substantially. This indicates that it is enough for method postconditions to predicate only on the externally observable state/data and that intermediate steps should be checked when testing at system level.
- Published
- 2020
24. FrUITeR: A Framework for Evaluating UI Test Reuse
- Author
-
Zhao, Yixue, Chen, Justin, Sejfia, Adriana, Laser, Marcelo Schmitt, Zhang, Jie, Sarro, Federica, Harman, Mark, and Medvidovic, Nenad
- Subjects
Computer Science - Software Engineering - Abstract
UI testing is tedious and time-consuming due to the manual effort required. Recent research has explored opportunities for reusing existing UI tests from an app to automatically generate new tests for other apps. However, the evaluation of such techniques currently remains manual, unscalable, and unreproducible, which can waste effort and impede progress in this emerging area. We introduce FrUITeR, a framework that automatically evaluates UI test reuse in a reproducible way. We apply FrUITeR to existing test-reuse techniques on a uniform benchmark we established, resulting in 11,917 test reuse cases from 20 apps. We report several key findings aimed at improving UI test reuse that are missed by existing work., Comment: ESEC/FSE 2020
- Published
- 2020
- Full Text
- View/download PDF
25. Ownership at Large -- Open Problems and Challenges in Ownership Management
- Author
-
Ahlgren, John, Berezin, Maria Eugenia, Bojarczuk, Kinga, Dulskyte, Elena, Dvortsova, Inna, George, Johann, Gucevska, Natalija, Harman, Mark, He, Shan, Lämmel, Ralf, Meijer, Erik, Sapora, Silvia, and Spahr-Summers, Justin
- Subjects
Computer Science - Software Engineering ,Computer Science - Information Retrieval ,Computer Science - Machine Learning - Abstract
Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, evolve (and thereby assume ownership of) a given asset. This paper introduces the Facebook Ownesty system, which uses a combination of ultra large scale data mining and machine learning and has been deployed at Facebook as part of the company's ownership management approach. Ownesty processes many millions of software assets (e.g., source-code files) and it takes into account workflow and organizational aspects. The paper sets out open problems and challenges on ownership for the research community with advances expected from the fields of software engineering, programming languages, and machine learning., Comment: Author order is alphabetical. Contact author: Ralf L\"ammel (rlaemmel@acm.org). The subject of the paper is covered by the contact author's keynote at the same conference
- Published
- 2020
26. WES: Agent-based User Interaction Simulation on Real Infrastructure
- Author
-
Ahlgren, John, Berezin, Maria Eugenia, Bojarczuk, Kinga, Dulskyte, Elena, Dvortsova, Inna, George, Johann, Gucevska, Natalija, Harman, Mark, Lämmel, Ralf, Meijer, Erik, Sapora, Silvia, and Spahr-Summers, Justin
- Subjects
Computer Science - Software Engineering ,Computer Science - Human-Computer Interaction ,Computer Science - Machine Learning ,Computer Science - Social and Information Networks - Abstract
We introduce the Web-Enabled Simulation (WES) research agenda, and describe FACEBOOK's WW system. We describe the application of WW to reliability, integrity and privacy at FACEBOOK , where it is used to simulate social media interactions on an infrastructure consisting of hundreds of millions of lines of code. The WES agenda draws on research from many areas of study, including Search Based Software Engineering, Machine Learning, Programming Languages, Multi Agent Systems, Graph Theory, Game AI, and AI Assisted Game Play. We conclude with a set of open problems and research challenges to motivate wider investigation., Comment: Author order is alphabetical. Correspondence to Mark Harman (markharman@fb.com). This paper appears in GI 2020: 8th International Workshop on Genetic Improvement
- Published
- 2020
27. Inferring test models from user bug reports using multi-objective search
- Author
-
Guizzo, Giovani, Califano, Francesco, Sarro, Federica, Ferrucci, Filomena, and Harman, Mark
- Published
- 2023
- Full Text
- View/download PDF
28. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair
- Author
-
Cordy, Maxime, Rwemalika, Renaud, Papadakis, Mike, and Harman, Mark
- Subjects
Computer Science - Software Engineering - Abstract
Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. Unfortunately, flaky tests have major implications for testing and test-dependent activities such as mutation testing and automated program repair. To deal with this issue, we introduce a test flakiness assessment and experimentation platform, called FlakiMe, that supports the seeding of a (controllable) degree of flakiness into the behaviour of a given test suite. Thereby, FlakiMe equips researchers with ways to investigate the impact of test flakiness on their techniques under laboratory-controlled conditions. We use FlakiME to report results and insights from case studies that assesses the impact of flakiness on mutation testing and program repair. These results indicate that a 5% of flakiness failures is enough to affect the mutation score, but the effect size is modest (2% - 4% ), while it completely annihilates the ability of program repair to patch 50% of the subject programs. We also observe that flakiness has case-specific effects, which mainly disrupts the repair of bugs that are covered by many tests. Moreover, we find that a minimal amount of user feedback is sufficient for alleviating the effects of flakiness.
- Published
- 2019
29. Automatic Testing and Improvement of Machine Translation
- Author
-
Sun, Zeyu, Zhang, Jie M., Harman, Mark, Papadakis, Mike, and Zhang, Lu
- Subjects
Computer Science - Software Engineering - Abstract
This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).
- Published
- 2019
30. A Survey of Constrained Combinatorial Testing
- Author
-
Wu, Huayao, Nie, Changhai, Petke, Justyna, Jia, Yue, and Harman, Mark
- Subjects
Computer Science - Software Engineering - Abstract
Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained combinatorial testing. In this paper, we provide a comprehensive survey of representations, influences, and techniques that pertain to constraints in CT, covering 129 papers published between 1987 and 2018. This survey not only categorises the various constraint handling techniques, but also reviews comparatively less well-studied, yet potentially important, constraint identification and maintenance techniques. Since real-world programs are usually constrained, this survey can be of interest to researchers and practitioners who are looking to use and study constrained combinatorial testing techniques.
- Published
- 2019
31. Machine Learning Testing: Survey, Landscapes and Horizons
- Author
-
Zhang, Jie M., Harman, Mark, Ma, Lei, and Liu, Yang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Software Engineering ,Statistics - Machine Learning - Abstract
This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.
- Published
- 2019
32. Sub-Turing Islands in the Wild
- Author
-
Barr, Earl T., Binkley, David W., Harman, Mark, and Seghir, Mohamed Nassim
- Subjects
Computer Science - Programming Languages - Abstract
Recently, there has been growing debate as to whether or not static analysis can be truly sound. In spite of this concern, research on techniques seeking to at least partially answer undecidable questions has a long history. However, little attention has been given to the more empirical question of how often an exact solution might be given to a question despite the question being, at least in theory, undecidable. This paper investigates this issue by exploring sub-Turing islands -- regions of code for which a question of interest is decidable. We define such islands and then consider how to identify them. We implemented Cook, a prototype for finding sub-Turing islands and applied it to a corpus of 1100 Android applications, containing over 2 million methods. Results reveal that 55\% of the all methods are sub-Turing. Our results also provide empirical, scientific evidence for the scalability of sub-Turing island identification. Sub-Turing identification has many downstream applications, because islands are so amenable to static analysis. We illustrate two downstream uses of the analysis. In the first, we found that over 37\% of the verification conditions associated with runtime exceptions fell within sub-Turing islands and thus are statically decidable. A second use of our analysis is during code review where it provides guidance to developers. The sub-Turing islands from our study turns out to contain significantly fewer bugs than `theswamp' (non sub-Turing methods). The greater bug density in the swamp is unsurprising; the fact that bugs remain prevalent in islands is, however, surprising: these are bugs whose repair can be fully automated.
- Published
- 2019
33. Model Validation Using Mutated Training Labels: An Exploratory Study
- Author
-
Zhang, Jie M., Harman, Mark, Guedj, Benjamin, Barr, Earl T., and Shawe-Taylor, John
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We introduce an exploratory study on Mutation Validation (MV), a model validation method using mutated training labels for supervised learning. MV mutates training data labels, retrains the model against the mutated data, then uses the metamorphic relation that captures the consequent training performance changes to assess model fit. It does not use a validation set or test set. The intuition underpinning MV is that overfitting models tend to fit noise in the training data. We explore 8 different learning algorithms, 18 datasets, and 5 types of hyperparameter tuning tasks. Our results demonstrate that MV is accurate in model selection: the model recommendation hit rate is 92\% for MV and less than 60\% for out-of-sample-validation. MV also provides more stable hyperparameter tuning results than out-of-sample-validation across different runs.
- Published
- 2019
34. Model validation using mutated training labels: An exploratory study
- Author
-
Zhang, Jie M., Harman, Mark, Guedj, Benjamin, Barr, Earl T., and Shawe-Taylor, John
- Published
- 2023
- Full Text
- View/download PDF
35. Fairness Improvement with Multiple Protected Attributes: How Far Are We?
- Author
-
Chen, Zhenpeng, primary, Zhang, Jie M., additional, Sarro, Federica, additional, and Harman, Mark, additional
- Published
- 2024
- Full Text
- View/download PDF
36. Selected Stories
- Author
-
Kafka, Franz, Harman, Mark, Translated and edited by, Kafka, Franz, and Harman, Mark
- Published
- 2024
37. Indexing Operators to Extend the Reach of Symbolic Execution
- Author
-
Barr, Earl T., Clark, David, Harman, Mark, and Marginean, Alexandru
- Subjects
Computer Science - Software Engineering - Abstract
Traditional program analysis analyses a program language, that is, all programs that can be written in the language. There is a difference, however, between all possible programs that can be written and the corpus of actual programs written in a language. We seek to exploit this difference: for a given program, we apply a bespoke program transformation Indexify to convert expressions that current SMT solvers do not, in general, handle, such as constraints on strings, into equisatisfiable expressions that they do handle. To this end, Indexify replaces operators in hard-to-handle expressions with homomorphic versions that behave the same on a finite subset of the domain of the original operator, and return bottom denoting unknown outside of that subset. By focusing on what literals and expressions are most useful for analysing a given program, Indexify constructs a small, finite theory that extends the power of a solver on the expressions a target program builds. Indexify's bespoke nature necessarily means that its evaluation must be experimental, resting on a demonstration of its effectiveness in practice. We have developed Indexif}, a tool for Indexify. We demonstrate its utility and effectiveness by applying it to two real world benchmarks --- string expressions in coreutils and floats in fdlibm53. Indexify reduces time-to-completion on coreutils from Klee's 49.5m on average to 6.0m. It increases branch coverage on coreutils from 30.10% for Klee and 14.79% for Zesti to 66.83%. When indexifying floats in fdlibm53, Indexifyl increases branch coverage from 34.45% to 71.56% over Klee. For a restricted class of inputs, Indexify permits the symbolic execution of program paths unreachable with previous techniques: it covers more than twice as many branches in coreutils as Klee.
- Published
- 2018
38. A Study of Bug Resolution Characteristics in Popular Programming Languages
- Author
-
Zhang, Jie M., Li, Feng, Hao, Dan, Wang, Meng, Tang, Hao, Zhang, Lu, and Harman, Mark
- Subjects
Computer Science - Software Engineering - Abstract
This paper presents a large-scale study that investigates the bug resolution characteristics among popular Github projects written in different programming languages. We explore correlations but, of course, we cannot infer causation. Specifically, we analyse bug resolution data from approximately 70 million Source Line of Code, drawn from 3 million commits to 600 GitHub projects, primarily written in 10 programming languages. We find notable variations in apparent bug resolution time and patch (fix) size. While interpretation of results from such large-scale empirical studies is inherently difficult, we believe that the differences in medians are sufficiently large to warrant further investigation, replication, re-analysis and follow up research. For example, in our corpus, the median apparent bug resolution time (elapsed time from raise to resolve) for Ruby was 4X that for Go and 2.5X for Java. We also found that patches tend to touch more files for the corpus of strongly typed and for statically typed programs. However, we also found evidence for a lower elapsed resolution time for bug resolution committed to projects constructed from statically typed languages. These findings, if replicated in subsequent follow on studies, may shed further empirical light on the debate about the importance of static typing.
- Published
- 2018
39. Multi-objective software performance optimisation at the architecture level using randomised search rules
- Author
-
Ni, Youcong, Du, Xin, Ye, Peng, Minku, Leandro L., Yao, Xin, Harman, Mark, and Xiao, Ruliang
- Published
- 2021
- Full Text
- View/download PDF
40. Kafka and the Muirs
- Author
-
Harman, Mark
- Subjects
Literature/writing - Abstract
In his characteristically incisive commentary on Willa and Edwin Muir (August 11), Ritchie Robertson chides me for being unduly critical of the Muirs' Kafka translations. In fact I greatly admire [...]
- Published
- 2023
41. Fairness Testing: A Comprehensive Survey and Analysis of Trends.
- Author
-
Chen, Zhenpeng, Zhang, Jie M., Hort, Max, Harman, Mark, and Sarro, Federica
- Subjects
TREND analysis ,FAIRNESS ,SOFTWARE engineers ,MACHINE learning - Abstract
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this article offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely adopted datasets and open-source tools for fairness testing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Mutation analysis for evaluating code translation
- Author
-
Guizzo, Giovani, primary, Zhang, Jie M., additional, Sarro, Federica, additional, Treude, Christoph, additional, and Harman, Mark, additional
- Published
- 2023
- Full Text
- View/download PDF
43. Deploying Search Based Software Engineering with Sapienz at Facebook
- Author
-
Alshahwan, Nadia, Gao, Xinbo, Harman, Mark, Jia, Yue, Mao, Ke, Mols, Alexander, Tei, Taijin, Zorin, Ilya, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Colanzi, Thelma Elita, editor, and McMinn, Phil, editor
- Published
- 2018
- Full Text
- View/download PDF
44. We Need a Testability Transformation Semantics
- Author
-
Harman, Mark, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Johnsen, Einar Broch, editor, and Schaefer, Ina, editor
- Published
- 2018
- Full Text
- View/download PDF
45. “A great stress among students” - mental health nurses' views of medication education: A qualitative descriptive study
- Author
-
Goodwin, John, Kilty, Caroline, Harman, Mark, and Horgan, Aine
- Published
- 2019
- Full Text
- View/download PDF
46. Who Judges the Judge: An Empirical Study on Online Judge Tests
- Author
-
Liu, Kaibo, primary, Han, Yudong, additional, Zhang, Jie M., additional, Chen, Zhenpeng, additional, Sarro, Federica, additional, Harman, Mark, additional, Huang, Gang, additional, and Ma, Yun, additional
- Published
- 2023
- Full Text
- View/download PDF
47. Large Language Models for Software Engineering: Survey and Open Problems
- Author
-
Fan, Angela, primary, Gokkaya, Beliz, additional, Harman, Mark, additional, Lyubarskiy, Mitya, additional, Sengupta, Shubho, additional, Yoo, Shin, additional, and Zhang, Jie M., additional
- Published
- 2023
- Full Text
- View/download PDF
48. Software Testing Research Challenges: An Industrial Perspective
- Author
-
Alshahwan, Nadia, primary, Harman, Mark, additional, and Marginean, Alexandru, additional
- Published
- 2023
- Full Text
- View/download PDF
49. API-Constrained Genetic Improvement
- Author
-
Langdon, William B., White, David R., Harman, Mark, Jia, Yue, Petke, Justyna, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Sarro, Federica, editor, and Deb, Kalyanmoy, editor
- Published
- 2016
- Full Text
- View/download PDF
50. HOMI: Searching Higher Order Mutants for Software Improvement
- Author
-
Wu, Fan, Harman, Mark, Jia, Yue, Krinke, Jens, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Sarro, Federica, editor, and Deb, Kalyanmoy, editor
- Published
- 2016
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.