1. Binary level toolchain provenance identification with graph neural networks
- Author
-
Tristan Benoit, Jean-Yves Marion, Sébastien Bardin, Carbone (CARBONE), Department of Formal Methods (LORIA - FM), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), CEA- Saclay (CEA), Commissariat à l'énergie atomique et aux énergies alternatives (CEA), This work is supported by (i) a public grant overseen by the French National Research Agency (ANR) as part of the 'Investissements d'Avenir' French PIA project 'Lorraine Université d'Excellence', reference ANR-15-IDEX-04-LUE, and (ii) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 830927 (Concordia). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)., GRID5000, IMPACT-DIGITRUST, ANR-15-IDEX-0004,LUE,Isite LUE(2015), European Project: 830927,CONCORDIA(2019), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Subjects
Theoretical computer science ,Artificial neural network ,Computer science ,Binary number ,graph neural networks ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,toolchain provenance ,Toolchain ,[INFO.INFO-CR]Computer Science [cs]/Cryptography and Security [cs.CR] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Graph reduction ,ACM: D.: Software/D.2: SOFTWARE ENGINEERING/D.2.7: Distribution, Maintenance, and Enhancement/D.2.7.5: Restructuring, reverse engineering, and reengineering ,ACM: D.: Software/D.2: SOFTWARE ENGINEERING/D.2.5: Testing and Debugging/D.2.5.2: Diagnostics ,Control flow graph ,binary code analysis ,020201 artificial intelligence & image processing ,Binary code ,Compiler ,computer ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning - Abstract
International audience; We consider the problem of recovering the compiling chain used to generate a given stripped binary code. We present a Graph Neural Network framework at the binary level to solve this problem, with the idea to take into account the shallow semantics provided by the binary code's structured control flow graph (CFG).We introduce a Graph Neural Network, called Site Neural Network (SNN), dedicated to this problem. To attain scalability at the binary level, feature extraction is simplified by forgetting almost everything in a CFG except transfer control instructions and performing a parametric graph reduction. Our experiments show that our method recovers the compiler family with a very high F1-Score of 0.9950 while the optimization level is recovered with a moderately high F1-Score of 0.7517. On the compiler version prediction task, the F1-Score is about 0.8167 excluding the clang family. A comparison with a previous work demonstrates the accuracy and performance of this framework.
- Published
- 2021