This paper describes a machine learning system designed to identify sensitive data within Italian text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). To overcome the lack of suitable training datasets, which would require the disclosure of sensitive data from real users, the proposed system exploits a Large Language Model (LLM) to generate synthetic documents that can be used to train supervised classifiers to detect the target sensitive data. We show that “artificial” sensitive data can be generated using both proprietary or open source LLMs, demonstrating that the proposed approach can be implemented either using external services or by relying on locally runnable models. We focus on the detection of six key domains of sensitive data, by training supervised classifiers based on the BERT Transformer architecture adapted to carry out text classification and Named-Entity Recognition (NER) tasks. We evaluate the performance of the system using fine-grained metrics, and show that the NER model can achieve a remarkable detection performance (over 90% F1 score), thus confirming the quality of the synthetic datasets generated with both proprietary and open source LLMs. The dataset we generated using the open source model is made publicly available for download.

Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data

Testolin A.
2024

Abstract

This paper describes a machine learning system designed to identify sensitive data within Italian text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). To overcome the lack of suitable training datasets, which would require the disclosure of sensitive data from real users, the proposed system exploits a Large Language Model (LLM) to generate synthetic documents that can be used to train supervised classifiers to detect the target sensitive data. We show that “artificial” sensitive data can be generated using both proprietary or open source LLMs, demonstrating that the proposed approach can be implemented either using external services or by relying on locally runnable models. We focus on the detection of six key domains of sensitive data, by training supervised classifiers based on the BERT Transformer architecture adapted to carry out text classification and Named-Entity Recognition (NER) tasks. We evaluate the performance of the system using fine-grained metrics, and show that the NER model can achieve a remarkable detection performance (over 90% F1 score), thus confirming the quality of the synthetic datasets generated with both proprietary and open source LLMs. The dataset we generated using the open source model is made publicly available for download.
2024
CEUR Workshop Proceedings
20th Conference on Information and Research Science Connecting to Digital and Library Science, IRCDL 2024
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3524608
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact