Author: "Zhang, Jie M." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Jie M."' showing total 98 results

Start Over Author "Zhang, Jie M."

98 results on '"Zhang, Jie M."'

1. Benchmarking Bias in Large Language Models during Role-Playing

Author: Li, Xinyue, Chen, Zhenpeng, Zhang, Jie M., Lou, Yiling, Li, Tianlin, Sun, Weisong, Liu, Yang, and Liu, Xuanzhe
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.
Published: 2024

2. Personality-Guided Code Generation Using Large Language Models

Author: Guo, Yaoqi, Chen, Zhenpeng, Zhang, Jie M., Liu, Yang, and Ma, Yun
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.
Published: 2024

3. Using Protected Attributes to Consider Fairness in Multi-Agent Systems

Author: La Malfa, Gabriele, Zhang, Jie M., Luck, Michael, and Black, Elizabeth
Subjects: Computer Science - Multiagent Systems, Computer Science - Artificial Intelligence
Abstract: Fairness in Multi-Agent Systems (MAS) has been extensively studied, particularly in reward distribution among agents in scenarios such as goods allocation, resource division, lotteries, and bargaining systems. Fairness in MAS depends on various factors, including the system's governing rules, the behaviour of the agents, and their characteristics. Yet, fairness in human society often involves evaluating disparities between disadvantaged and privileged groups, guided by principles of Equality, Diversity, and Inclusion (EDI). Taking inspiration from the work on algorithmic fairness, which addresses bias in machine learning-based decision-making, we define protected attributes for MAS as characteristics that should not disadvantage an agent in terms of its expected rewards. We adapt fairness metrics from the algorithmic fairness literature -- namely, demographic parity, counterfactual fairness, and conditional statistical parity -- to the multi-agent setting, where self-interested agents interact within an environment. These metrics allow us to evaluate the fairness of MAS, with the ultimate aim of designing MAS that do not disadvantage agents based on protected attributes.
Published: 2024

4. Effi-Code: Unleashing Code Efficiency in Language Models

Author: Huang, Dong, Zeng, Guangtao, Dai, Jianbo, Luo, Meng, Weng, Han, Qing, Yuhao, Cui, Heming, Guo, Zhijiang, and Zhang, Jie M.
Subjects: Computer Science - Computation and Language, Computer Science - Software Engineering
Abstract: As the use of large language models (LLMs) for code generation becomes more prevalent in software development, it is critical to enhance both the efficiency and correctness of the generated code. Existing methods and models primarily focus on the correctness of LLM-generated code, ignoring efficiency. In this work, we present Effi-Code, an approach to enhancing code generation in LLMs that can improve both efficiency and correctness. We introduce a Self-Optimization process based on Overhead Profiling that leverages open-source LLMs to generate a high-quality dataset of correct and efficient code samples. This dataset is then used to fine-tune various LLMs. Our method involves the iterative refinement of generated code, guided by runtime performance metrics and correctness checks. Extensive experiments demonstrate that models fine-tuned on the Effi-Code show significant improvements in both code correctness and efficiency across task types. For example, the pass@1 of DeepSeek-Coder-6.7B-Instruct generated code increases from \textbf{43.3\%} to \textbf{76.8\%}, and the average execution time for the same correct tasks decreases by \textbf{30.5\%}. Effi-Code offers a scalable and generalizable approach to improving code generation in AI systems, with potential applications in software development, algorithm design, and computational problem-solving. The source code of Effi-Code was released in \url{https://github.com/huangd1999/Effi-Code}., Comment: Under Review
Published: 2024

5. Rethinking the Influence of Source Code on Test Case Generation

Author: Huang, Dong, Zhang, Jie M., Du, Mingzhe, Harman, Mark, and Cui, Heming
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests., Comment: 23 pages
Published: 2024

6. An Exploratory Study on Using Large Language Models for Mutation Testing

Author: Wang, Bo, Chen, Mingda, Lin, Youfang, Papadakis, Mike, and Zhang, Jie M.
Subjects: Computer Science - Software Engineering, D.2.5
Abstract: Mutation testing is a foundation approach in the software testing field, based on automatically seeded small syntactic changes, known as mutations. The question of how to generate high-utility mutations, to be used for testing purposes, forms a key challenge in mutation testing literature. Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. To this end, we systematically investigate the performance of LLMs in generating effective mutations w.r.t. to their usability, fault detection potential, and relationship with real bugs. In particular, we perform a large-scale empirical study involving six LLMs, including both state-of-the-art open- and closed-source models, and 851 real bugs on two Java benchmarks (i.e., 605 bugs from 12 projects of Defects4J 2.0 and 246 bugs of ConDefects). We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs, which leads to approximately 19% higher fault detection than current approaches (i.e., 93% vs. 74%). Nevertheless, the mutants generated by LLMs have worse compilability rate, useless mutation rate, and equivalent mutation rate than those generated by rule-based approaches. This paper also examines alternative prompt engineering strategies and identifies the root causes of uncompilable mutations, providing insights for researchers to further enhance the performance of LLMs in mutation testing., Comment: 23 pages, 3 figures
Published: 2024

7. EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization

Author: Huang, Dong, Dai, Jianbo, Weng, Han, Wu, Puzhen, Qing, Yuhao, Cui, Heming, Guo, Zhijiang, and Zhang, Jie M.
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: Large language models (LLMs) have shown remarkable progress in code generation, but their generated code often suffers from inefficiency, resulting in longer execution times and higher memory consumption. To address this issue, we propose \textbf{EffiLearner}, a self-optimization framework that utilizes execution overhead profiles to improve the efficiency of LLM-generated code. EffiLearner first generates code using an LLM, then executes it locally to capture execution time and memory usage profiles. These profiles are fed back to the LLM, which then revises the code to reduce overhead. To evaluate the effectiveness of EffiLearner, we conduct extensive experiments on the EffiBench, HumanEval, and MBPP with 16 open-source and 6 closed-source models. Our evaluation results demonstrate that through iterative self-optimization, EffiLearner significantly enhances the efficiency of LLM-generated code. For example, the execution time (ET) of StarCoder2-15B for the EffiBench decreases from 0.93 (s) to 0.12 (s) which reduces 87.1% the execution time requirement compared with the initial code. The total memory usage (TMU) of StarCoder2-15B also decreases from 22.02 (Mb*s) to 2.03 (Mb*s), which decreases 90.8% of total memory consumption during the execution process. The source code of EffiLearner was released in \url{https://github.com/huangd1999/EffiLearner}., Comment: Accepted by NeurIPS 2024
Published: 2024

8. LLM-Powered Test Case Generation for Detecting Tricky Bugs

Author: Liu, Kaibo, Liu, Yiyang, Chen, Zhenpeng, Zhang, Jie M., Han, Yudong, Ma, Yun, Li, Ge, and Huang, Gang
Subjects: Computer Science - Software Engineering, Computer Science - Machine Learning
Abstract: Conventional automated test generation tools struggle to generate test oracles and tricky bug-revealing test inputs. Large Language Models (LLMs) can be prompted to produce test inputs and oracles for a program directly, but the precision of the tests can be very low for complex scenarios (only 6.3% based on our experiments). To fill this gap, this paper proposes AID, which combines LLMs with differential testing to generate fault-revealing test inputs and oracles targeting plausibly correct programs (i.e., programs that have passed all the existing tests). In particular, AID selects test inputs that yield diverse outputs on a set of program variants generated by LLMs, then constructs the test oracle based on the outputs. We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus, and compare it with three state-of-the-art baselines. The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.
Published: 2024

9. Research Artifacts in Software Engineering Publications: Status and Trends

Author: Liu, Mugeng, Huang, Xiaolong, He, Wei, Xie, Yibing, Zhang, Jie M., Jing, Xiang, Chen, Zhenpeng, and Ma, Yun
Subjects: Computer Science - Software Engineering
Abstract: The Software Engineering (SE) community has been embracing the open science policy and encouraging researchers to disclose artifacts in their publications. However, the status and trends of artifact practice and quality remain unclear, lacking insights on further improvement. In this paper, we present an empirical study to characterize the research artifacts in SE publications. Specifically, we manually collect 1,487 artifacts from all 2,196 papers published in top-tier SE conferences (ASE, FSE, ICSE, and ISSTA) from 2017 to 2022. We investigate the common practices (e.g., URL location and format, storage websites), maintenance activities (e.g., last update time and URL validity), popularity (e.g., the number of stars on GitHub and characteristics), and quality (e.g., documentation and code smell) of these artifacts. Based on our analysis, we reveal a rise in publications providing artifacts. The usage of Zenodo for sharing artifacts has significantly increased. However, artifacts stored in GitHub tend to receive few stars, indicating a limited influence on real-world SE applications. We summarize the results and provide suggestions to different stakeholders in conjunction with current guidelines., Comment: Accepted by Journal of Systems and Software (JSS 2024). Please include JSS in any citations
Published: 2024

10. EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Author: Huang, Dong, Qing, Yuhao, Shang, Weiyi, Cui, Heming, and Zhang, Jie M.
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard., Comment: Camera Ready for NeurIPS 2024
Published: 2024

11. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Author: Huang, Dong, Zhang, Jie M., Luck, Michael, Bu, Qingwen, Qing, Yuhao, and Cui, Heming
Subjects: Computer Science - Computation and Language
Abstract: The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K., Comment: 24 pages, 12 figures
Published: 2023

12. ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair

Author: Wu, Yonghao, Li, Zheng, Zhang, Jie M., and Liu, Yong
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: With the growing interest on Large Language Models (LLMs) for fault localization and program repair, ensuring the integrity and generalizability of the LLM-based methods becomes paramount. The code in existing widely-adopted benchmarks for these tasks was written before the the bloom of LLMs and may be included in the training data of existing popular LLMs, thereby suffering from the threat of data leakage, leading to misleadingly optimistic performance metrics. To address this issue, we introduce "ConDefects", a novel dataset of real faults meticulously curated to eliminate such overlap. ConDefects contains 1,254 Java faulty programs and 1,625 Python faulty programs. All these programs are sourced from the online competition platform AtCoder and were produced between October 2021 and September 2023. We pair each fault with fault locations and the corresponding repaired code versions, making it tailored for in fault localization and program repair related research. We also provide interfaces for selecting subsets based on different time windows and coding task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted for benchmarking ALL types of fault localization and program repair methods. The dataset is publicly available, and a demo video can be found at https://www.youtube.com/watch?v=22j15Hj5ONk., Comment: 5pages, 3 figures
Published: 2023

13. Large Language Models for Software Engineering: Survey and Open Problems

Author: Fan, Angela, Gokkaya, Beliz, Harman, Mark, Lyubarskiy, Mitya, Sengupta, Shubho, Yoo, Shin, and Zhang, Jie M.
Subjects: Computer Science - Software Engineering
Abstract: This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics. However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations. Our survey reveals the pivotal role that hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE.
Published: 2023

14. Large Language Models in Fault Localisation

Author: Wu, Yonghao, Li, Zheng, Zhang, Jie M., Papadakis, Mike, Harman, Mark, and Liu, Yong
Subjects: Computer Science - Software Engineering
Abstract: Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, program repair, code summarisation, and test generation. Fault localisation is instrumental in enabling automated debugging and repair of programs and was prominently featured as a highlight during the launch event of ChatGPT-4. Nevertheless, the performance of LLMs compared to state-of-the-art methods, as well as the impact of prompt design and context length on their efficacy, remains unclear. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted large-scale Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the consistency of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and consistency, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL on the Defects4J dataset in terms of TOP-1 metric. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop, with 49.9% lower accuracy than SmartFL under TOP-1 metric. These observations indicate that although ChatGPT can effectively localise faults under specific conditions, limitations are evident. Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
Published: 2023

15. COCO: Testing Code Generation Systems via Concretized Instructions

Author: Yan, Ming, Chen, Junjie, Zhang, Jie M., Cao, Xuejie, Yang, Chen, and Harman, Mark
Subjects: Computer Science - Software Engineering
Abstract: Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly different instructions can result in significantly different code semantics. Robustness is critical for code generation systems, as it can have significant impacts on software development, software quality, and trust in the generated code. Although existing testing techniques for general text-to-text software can detect some robustness issues, they are limited in effectiveness due to ignoring the characteristics of code generation systems. In this work, we propose a novel technique COCO to test the robustness of code generation systems. It exploits the usage scenario of code generation systems to make the original programming instruction more concrete by incorporating features known to be contained in the original code. A robust system should maintain code semantics for the concretized instruction, and COCO detects robustness inconsistencies when it does not. We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT, using two widely-used datasets. Our results demonstrate the effectiveness of COCO in testing the robustness of code generation systems, outperforming two techniques adopted from general text-to-text software testing by 466.66% and 104.02%, respectively. Furthermore, concretized instructions generated by COCO can help reduce robustness inconsistencies by 18.35% to 53.91% through fine-tuning.
Published: 2023

16. An Empirical Study of the Non-determinism of ChatGPT in Code Generation

Author: Ouyang, Shuyin, Zhang, Jie M., Harman, Mark, and Wang, Meng
Subjects: Computer Science - Software Engineering
Abstract: There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of coding tasks with zero equal test output across different requests is 75.76%, 51.00%, and 47.56% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusions.
Published: 2023
Full Text: View/download PDF

17. Bias Behind the Wheel: Fairness Testing of Autonomous Driving Systems

Author: Li, Xinyue, Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, Zhang, Ying, and Liu, Xuanzhe
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Software Engineering
Abstract: This paper conducts fairness testing of automated pedestrian detection, a crucial but under-explored issue in autonomous driving systems. We evaluate eight state-of-the-art deep learning-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable thorough fairness testing, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues, particularly related to age. The proportion of undetected children is 20.14% higher compared to adults. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that pedestrian detectors demonstrate significant gender biases during night time, potentially exacerbating the prevalent societal issue of female safety concerns during nighttime out. Moreover, we observe that pedestrian detectors can demonstrate both enhanced fairness and superior performance under specific driving conditions, which challenges the fairness-performance trade-off theory widely acknowledged in the fairness literature. We publicly release the code, data, and results to support future research on fairness in autonomous driving., Comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)
Published: 2023

18. Fairness Improvement with Multiple Protected Attributes: How Far Are We?

Author: Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, and Harman, Mark
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Software Engineering
Abstract: Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate., Comment: Accepted by the 46th International Conference on Software Engineering (ICSE 2024). Please include ICSE in any citations
Published: 2023

19. Vulnerability Detection with Graph Simplification and Enhanced Graph Representation Learning

Author: Wen, Xin-Cheng, Chen, Yupan, Gao, Cuiyun, Zhang, Hongyu, Zhang, Jie M., and Liao, Qing
Subjects: Computer Science - Software Engineering
Abstract: Prior studies have demonstrated the effectiveness of Deep Learning (DL) in automated software vulnerability detection. Graph Neural Networks (GNNs) have proven effective in learning the graph representations of source code and are commonly adopted by existing DL-based vulnerability detection methods. However, the existing methods are still limited by the fact that GNNs are essentially difficult to handle the connections between long-distance nodes in a code structure graph. Besides, they do not well exploit the multiple types of edges in a code structure graph (such as edges representing data flow and control flow). Consequently, despite achieving state-of-the-art performance, the existing GNN-based methods tend to fail to capture global information (i.e., long-range dependencies among nodes) of code graphs. To mitigate these issues, in this paper, we propose a novel vulnerability detection framework with grAph siMplification and enhanced graph rePresentation LEarning, named AMPLE. AMPLE mainly contains two parts: 1) graph simplification, which aims at reducing the distances between nodes by shrinking the node sizes of code structure graphs; 2) enhanced graph representation learning, which involves one edge-aware graph convolutional network module for fusing heterogeneous edge information into node representations and one kernel-scaled representation module for well capturing the relations between distant graph nodes. Experiments on three public benchmark datasets show that AMPLE outperforms the state-of-the-art methods by 0.39%-35.32% and 7.64%-199.81% with respect to the accuracy and F1 score metrics, respectively. The results demonstrate the effectiveness of AMPLE in learning global information of code graphs for vulnerability detection., Comment: 13 pages, 8 figures, Accepted for publication in the ICSE 23 Technical Track
Published: 2023

20. Stealthy Backdoor Attack for Code Models

Author: Yang, Zhou, Xu, Bowen, Zhang, Jie M., Kang, Hong Jin, Shi, Jieke, He, Junda, and Lo, David
Subjects: Computer Science - Cryptography and Security, Computer Science - Software Engineering
Abstract: Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code and play a vital role in supporting downstream automated software engineering tasks. Most recently, code models were revealed to be vulnerable to backdoor attacks. A code model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs on examples injected with triggers that activate the backdoors. Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers. This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks. To this end, we propose AFRAIDOOR (Adversarial Feature as Adaptive Backdoor). AFRAIDOOR achieves stealthiness by leveraging adversarial perturbations to inject adaptive triggers into different inputs. We evaluate AFRAIDOOR on three widely adopted code models (CodeBERT, PLBART and CodeT5) and two downstream tasks (code summarization and method name prediction). We find that around 85% of adaptive triggers in AFRAIDOOR bypass the detection in the defense process. By contrast, only less than 12% of the triggers from previous work bypass the defense. When the defense method is not applied, both AFRAIDOOR and baselines have almost perfect attack success rates. However, once a defense is applied, the success rates of baselines decrease dramatically to 10.47% and 12.06%, while the success rate of AFRAIDOOR are 77.05% and 92.98% on the two tasks. Our finding exposes security weaknesses in code models under stealthy backdoor attacks and shows that the state-of-the-art defense method cannot provide sufficient protection. We call for more research efforts in understanding security threats to code models and developing more effective countermeasures., Comment: 18 pages, Under review of IEEE Transactions on Software Engineering
Published: 2023

21. Fairness Testing: A Comprehensive Survey and Analysis of Trends

Author: Chen, Zhenpeng, Zhang, Jie M., Hort, Max, Harman, Mark, and Sarro, Federica
Subjects: Computer Science - Software Engineering
Abstract: Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely-adopted datasets and open-source tools for fairness testing., Comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM 2024). Please include TOSEM in any citations
Published: 2022

22. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Author: Hort, Max, Chen, Zhenpeng, Zhang, Jie M., Harman, Mark, and Sarro, Federica
Subjects: Computer Science - Machine Learning
Abstract: This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., What is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?), we hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods., Comment: 52 pages, 7 figures
Published: 2022

23. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers

Author: Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, and Harman, Mark
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s)., Comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM 2023). Please include TOSEM in any citations
Published: 2022

24. Leveraging Automated Unit Tests for Unsupervised Code Translation

Author: Roziere, Baptiste, Zhang, Jie M., Charton, Francois, Harman, Mark, Synnaeve, Gabriel, and Lample, Guillaume
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
Published: 2021

25. Search-based Automatic Repair for Fairness and Accuracy in Decision-making Software

Author: Hort, Max, Zhang, Jie M., Sarro, Federica, and Harman, Mark
Published: 2024
Full Text: View/download PDF

26. Mutation analysis for evaluating code translation

Author: Guizzo, Giovani, Zhang, Jie M., Sarro, Federica, Treude, Christoph, and Harman, Mark
Published: 2024
Full Text: View/download PDF

27. Automatic Testing and Improvement of Machine Translation

Author: Sun, Zeyu, Zhang, Jie M., Harman, Mark, Papadakis, Mike, and Zhang, Lu
Subjects: Computer Science - Software Engineering
Abstract: This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).
Published: 2019

28. Machine Learning Testing: Survey, Landscapes and Horizons

Author: Zhang, Jie M., Harman, Mark, Ma, Lei, and Liu, Yang
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Software Engineering, Statistics - Machine Learning
Abstract: This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.
Published: 2019

29. Model Validation Using Mutated Training Labels: An Exploratory Study

Author: Zhang, Jie M., Harman, Mark, Guedj, Benjamin, Barr, Earl T., and Shawe-Taylor, John
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We introduce an exploratory study on Mutation Validation (MV), a model validation method using mutated training labels for supervised learning. MV mutates training data labels, retrains the model against the mutated data, then uses the metamorphic relation that captures the consequent training performance changes to assess model fit. It does not use a validation set or test set. The intuition underpinning MV is that overfitting models tend to fit noise in the training data. We explore 8 different learning algorithms, 18 datasets, and 5 types of hyperparameter tuning tasks. Our results demonstrate that MV is accurate in model selection: the model recommendation hit rate is 92\% for MV and less than 60\% for out-of-sample-validation. MV also provides more stable hyperparameter tuning results than out-of-sample-validation across different runs.
Published: 2019

30. Model validation using mutated training labels: An exploratory study

Author: Zhang, Jie M., Harman, Mark, Guedj, Benjamin, Barr, Earl T., and Shawe-Taylor, John
Published: 2023
Full Text: View/download PDF

31. A Study of Bug Resolution Characteristics in Popular Programming Languages

Author: Zhang, Jie M., Li, Feng, Hao, Dan, Wang, Meng, Tang, Hao, Zhang, Lu, and Harman, Mark
Subjects: Computer Science - Software Engineering
Abstract: This paper presents a large-scale study that investigates the bug resolution characteristics among popular Github projects written in different programming languages. We explore correlations but, of course, we cannot infer causation. Specifically, we analyse bug resolution data from approximately 70 million Source Line of Code, drawn from 3 million commits to 600 GitHub projects, primarily written in 10 programming languages. We find notable variations in apparent bug resolution time and patch (fix) size. While interpretation of results from such large-scale empirical studies is inherently difficult, we believe that the differences in medians are sufficiently large to warrant further investigation, replication, re-analysis and follow up research. For example, in our corpus, the median apparent bug resolution time (elapsed time from raise to resolve) for Ruby was 4X that for Go and 2.5X for Java. We also found that patches tend to touch more files for the corpus of strongly typed and for statically typed programs. However, we also found evidence for a lower elapsed resolution time for bug resolution committed to projects constructed from statically typed languages. These findings, if replicated in subsequent follow on studies, may shed further empirical light on the debate about the importance of static typing.
Published: 2018

32. Fairness Improvement with Multiple Protected Attributes: How Far Are We?

Author: Chen, Zhenpeng, primary, Zhang, Jie M., additional, Sarro, Federica, additional, and Harman, Mark, additional
Published: 2024
Full Text: View/download PDF

33. SOAP: Enhancing Efficiency of Generated Code via Self-Optimization

Author: Huang, Dong, Dai, Jianbo, Weng, Han, Wu, Puzhen, Qing, Yuhao, Zhang, Jie M., Cui, Heming, Guo, Zhijiang, Huang, Dong, Dai, Jianbo, Weng, Han, Wu, Puzhen, Qing, Yuhao, Zhang, Jie M., Cui, Heming, and Guo, Zhijiang
Abstract: Large language models (LLMs) have shown remarkable progress in code generation, but their generated code often suffers from inefficiency, resulting in longer execution times and higher memory consumption. To address this issue, we propose Self Optimization based on OverheAd Profile (SOAP), a self-optimization framework that utilizes execution overhead profiles to improve the efficiency of LLM-generated code. SOAP first generates code using an LLM, then executes it locally to capture execution time and memory usage profiles. These profiles are fed back to the LLM, which then revises the code to reduce overhead. To evaluate the effectiveness of SOAP, we conduct extensive experiments on the EffiBench, HumanEval, and MBPP with 16 open-source and 6 closed-source models. Our evaluation results demonstrate that through iterative self-optimization, SOAP significantly enhances the efficiency of LLM-generated code. For example, the execution time (ET) of StarCoder2-15B for the EffiBench decreases from 0.93 (s) to 0.12 (s) which reduces 87.1% execution time requirement compared with the initial code. The total memory usage (TMU) of StarCoder2-15B also decreases from 22.02 (Mb*s) to 2.03 (Mb*s), which decreases 90.8% total memory consumption during the execution process. The source code of SOAP was released in https://github.com/huangd1999/SOAP., Comment: 31 pages, 18 figures, and 8 tables
Published: 2024

34. Fairness Testing: A Comprehensive Survey and Analysis of Trends.

Author: Chen, Zhenpeng, Zhang, Jie M., Hort, Max, Harman, Mark, and Sarro, Federica
Subjects: TREND analysis, FAIRNESS, SOFTWARE engineers, MACHINE learning
Abstract: Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this article offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely adopted datasets and open-source tools for fairness testing. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. Mutation analysis for evaluating code translation

Author: Guizzo, Giovani, primary, Zhang, Jie M., additional, Sarro, Federica, additional, Treude, Christoph, additional, and Harman, Mark, additional
Published: 2023
Full Text: View/download PDF

36. Who Judges the Judge: An Empirical Study on Online Judge Tests

Author: Liu, Kaibo, primary, Han, Yudong, additional, Zhang, Jie M., additional, Chen, Zhenpeng, additional, Sarro, Federica, additional, Harman, Mark, additional, Huang, Gang, additional, and Ma, Yun, additional
Published: 2023
Full Text: View/download PDF

37. Large Language Models for Software Engineering: Survey and Open Problems

Author: Fan, Angela, primary, Gokkaya, Beliz, additional, Harman, Mark, additional, Lyubarskiy, Mitya, additional, Sengupta, Shubho, additional, Yoo, Shin, additional, and Zhang, Jie M., additional
Published: 2023
Full Text: View/download PDF

38. Bias Behind the Wheel: Fairness Analysis of Autonomous Driving Systems

Author: Li, Xinyue, Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, Zhang, Ying, Liu, Xuanzhe, Li, Xinyue, Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, Zhang, Ying, and Liu, Xuanzhe
Abstract: This paper analyzes fairness in automated pedestrian detection, a crucial but under-explored issue in autonomous driving systems. We evaluate eight state-of-the-art deep learning-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable thorough fairness testing, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues, particularly related to age. The undetected proportions for children are 20.14% higher compared to adults. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that pedestrian detectors demonstrate significant gender biases during night time, potentially exacerbating the prevalent societal issue of female safety concerns during nighttime out. Moreover, we observe that pedestrian detectors can demonstrate both enhanced fairness and superior performance under specific driving conditions, which challenges the fairness-performance trade-off theory widely acknowledged in the fairness literature. We publicly release the code, data, and results to support future research on fairness in autonomous driving.
Published: 2023

39. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation

Author: Ouyang, Shuyin, Zhang, Jie M., Harman, Mark, Wang, Meng, Ouyang, Shuyin, Zhang, Jie M., Harman, Mark, and Wang, Meng
Abstract: There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of coding tasks with zero equal test output across different requests is 72.73%, 60.40%, and 65.85% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusions.
Published: 2023

40. Stealthy Backdoor Attack for Code Models

Author: Yang, Zhou, Xu, Bowen, Zhang, Jie M., Kang, Hong Jin, Shi, Jieke, He, Junda, and Lo, David
Abstract: Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code and play a vital role in supporting downstream automated software engineering tasks. Most recently, code models were revealed to be vulnerable to backdoor attacks. A code model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs on examples injected with triggers that activate the backdoors. Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers. This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks. To this end, we propose Afraidoor (Adversarial Feature as Adaptive Backdoor). Afraidoor achieves stealthiness by leveraging adversarial perturbations to inject adaptive triggers into different inputs. We apply Afraidoor to three widely adopted code models (CodeBERT, PLBART, and CodeT5) and two downstream tasks (code summarization and method name prediction). We evaluate three widely used defense methods and find that Afraidoor is more unlikely to be detected by the defense methods than by baseline methods. More specifically, when using spectral signature as defense, around 85% of adaptive triggers in Afraidoor bypass the detection in the defense process. By contrast, only less than 12% of the triggers from previous work bypass the defense. When the defense method is not applied, both Afraidoor and baselines have almost perfect attack success rates. However, once a defense is applied, the attack success rates of baselines decrease dramatically, while the success rate of Afraidoor remains high. Our finding exposes security weaknesses in code models under stealthy backdoor attacks and shows that state-of-the-art defense methods cannot provide sufficient protection. We call for more research efforts in understanding security threats to code models and developing more effective countermeasures.
Published: 2024
Full Text: View/download PDF

41. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers

Author: Chen, Zhenpeng, primary, Zhang, Jie M., additional, Sarro, Federica, additional, and Harman, Mark, additional
Published: 2023
Full Text: View/download PDF

42. MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software

Author: Chen, Zhenpeng, primary, Zhang, Jie M., additional, Sarro, Federica, additional, and Harman, Mark, additional
Published: 2022
Full Text: View/download PDF

43. Natural Test Generation for Precise Testing of Question Answering Software

Author: Shen, Qingchao, primary, Chen, Junjie, additional, Zhang, Jie M., additional, Wang, Haoyu, additional, Liu, Shuang, additional, and Tian, Menghan, additional
Published: 2022
Full Text: View/download PDF

44. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers.

Author: ZHENPENG CHEN, ZHANG, JIE M., SARRO, FEDERICA, and HARMAN, MARK
Abstract: Software bias is an increasingly important operational concern for software engineers. We present a largescale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance tradeoff assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance tradeoff measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%∼66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%∼59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best tradeoff in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s). [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

45. Mutation analysis and its industrial applications

Author: Gopinath, Rahul, primary, Zhang, Jie M., additional, Kintis, Marinos, additional, and Papadakis, Mike, additional
Published: 2022
Full Text: View/download PDF

46. GraphCode2Vec: generic code embedding via lexical and program dependence analyses

Author: Ma, Wei, Zhao, Mengjie, Soremekun, Ezekiel, Hu, Qiang, Zhang, Jie M., Papadakis, Mike, Cordy, Maxime, Xie, Xiaofei, Traon, Yves Le, Ma, Wei, Zhao, Mengjie, Soremekun, Ezekiel, Hu, Qiang, Zhang, Jie M., Papadakis, Mike, Cordy, Maxime, Xie, Xiaofei, and Traon, Yves Le
Abstract: Code embedding is a keystone in the application of machine learn- ing on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end, we propose the first self-supervised pre-training approach (called GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves this via a synergistic combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, it allows pre-training, and it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name prediction, solution classification, mutation testing and overfitted patch classification), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, Graph- CodeBERT) and seven (7) task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines. It is also complementary and comparable to GraphCodeBERT (a larger and more complex model). We also demonstrate through a probing and ablation study that GraphCode2Vec learns lexical and program dependence features and that self-supervised pre-training improves effectiveness.
Published: 2022

47. GraphCode2Vec

Author: Ma, Wei, primary, Zhao, Mengjie, additional, Soremekun, Ezekiel, additional, Hu, Qiang, additional, Zhang, Jie M., additional, Papadakis, Mike, additional, Cordy, Maxime, additional, Xie, Xiaofei, additional, and Traon, Yves Le, additional
Published: 2022
Full Text: View/download PDF

48. Improving machine translation systems via isotopic replacement

Author: Sun, Zeyu, primary, Zhang, Jie M., additional, Xiong, Yingfei, additional, Harman, Mark, additional, Papadakis, Mike, additional, and Zhang, Lu, additional
Published: 2022
Full Text: View/download PDF

49. Machine Learning Testing: Survey, Landscapes and Horizons

Author: Zhang, Jie M., primary, Harman, Mark, additional, Ma, Lei, additional, and Liu, Yang, additional
Published: 2022
Full Text: View/download PDF

50. A Study of Bug Resolution Characteristics in Popular Programming Languages

Author: Zhang, Jie M., primary, Li, Feng, additional, Hao, Dan, additional, Wang, Meng, additional, Tang, Hao, additional, Zhang, Lu, additional, and Harman, Mark, additional
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

98 results on '"Zhang, Jie M."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources