Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to shift these expensive and time-consuming processes to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this, few resources have been devoted to training – and evaluating – models for GDA extraction. Besides, such resources are limited in size, preventing models from scaling effectively to large amounts of data. To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi- automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state- of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The dataset and models are publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations

Marchesin S.;Silvello G.
2022

Abstract

Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to shift these expensive and time-consuming processes to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this, few resources have been devoted to training – and evaluating – models for GDA extraction. Besides, such resources are limited in size, preventing models from scaling effectively to large amounts of data. To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi- automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state- of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The dataset and models are publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.
2022
Proceedings of the 30th Italian Symposium on Advanced Database Systems (SEBD 2022), June 19–22, 2022, Pisa, Italy
SEBD 2022
File in questo prodotto:
File Dimensione Formato  
paper16.pdf

accesso aperto

Tipologia: Published (publisher's version)
Licenza: Creative commons
Dimensione 215.44 kB
Formato Adobe PDF
215.44 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3508118
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact