Automated Program Repair (APR) is awell-established research area that enhances software reliability and security by automatically fixing bugs, reducing manual effort, and accelerating debugging. Despite progress in publishing benchmarks to evaluate APR tools, datasets specifically targeting Android are lacking.To address this gap, we introduce a dataset of 272 real-world violations of Google's Android Security Best Practices, identified by statically analyzing 113 real-world Android apps. In addition to the faulty code, we manually crafted repairs based on Google's guidelines, covering 176 Java-based and 96 XML-based violations from Android Java classes and Manifest files, respectively. Additionally, we leveraged our novel dataset to evaluate Large Language Models (LLMs) as they are the latest promising APR tools. In particular, we evaluated GPT-4o, Gemini 1.5 Flash and Gemini in Android Studio and we found that GPT-4o outperforms Google's models, demonstrating higher accuracy and robustness across a range of violations types. Hence, with this dataset, we aim to provide valuable insights for advancing APR research and improving tools for Android security.

A Dataset for Evaluating LLMs Vulnerability Repair Performance in Android Applications: Data/Toolset paper

Braconaro E.;Losiouk E.
2025

Abstract

Automated Program Repair (APR) is awell-established research area that enhances software reliability and security by automatically fixing bugs, reducing manual effort, and accelerating debugging. Despite progress in publishing benchmarks to evaluate APR tools, datasets specifically targeting Android are lacking.To address this gap, we introduce a dataset of 272 real-world violations of Google's Android Security Best Practices, identified by statically analyzing 113 real-world Android apps. In addition to the faulty code, we manually crafted repairs based on Google's guidelines, covering 176 Java-based and 96 XML-based violations from Android Java classes and Manifest files, respectively. Additionally, we leveraged our novel dataset to evaluate Large Language Models (LLMs) as they are the latest promising APR tools. In particular, we evaluated GPT-4o, Gemini 1.5 Flash and Gemini in Android Studio and we found that GPT-4o outperforms Google's models, demonstrating higher accuracy and robustness across a range of violations types. Hence, with this dataset, we aim to provide valuable insights for advancing APR research and improving tools for Android security.
2025
Codaspy 2025 Proceedings of the 15th ACM Conference on Data and Application Security and Privacy
The 15th ACM Conference on Data and Application Security and Privacy
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3560229
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact