Large language models in public health: opportunity or threat? The case of button battery injuries

Lorenzoni, Giulia; Gregori, Dario

doi:10.1136/ip-2025-045655

Background: Button battery (BB) injuries in children represent a severe and growing public health burden. The literature on the topic is extensive; however, there is a notable lack of structured public health initiatives addressing the problem. The present study aimed to test the feasibility of using large language models (LLMs) to draft recommendations for preventing and managing BB ingestion in children. Methods: A set of questions was generated and submitted to ChatGPT-4o and ChatGPT-o1-preview. Questions were based on statements and websites of scientific societies and not-for-profit organisations and were developed to produce comprehensive recommendations that provided information on BB risks, primary and secondary prevention, clinical management and follow-up, and general public health initiatives. Two independent reviewers rated the accuracy and readability of the questions submitted to the LLMs. The accuracy was rated using a four-level scale, while the readability was assessed using two established readability tools, the Flesch Reading Ease (FRE) and the Flesch-Kincaid Grade Level (FKGL). Results: None of the answers provided by the LLMs were rated as completely incorrect or partially incorrect. ChatGPT-o1-preview outperformed ChatGPT-4o in accuracy, with eight answers rated as accurate and complete. Both models showed similar readability levels, with high FKGL and FRE scores indicating college-level comprehension. Discussion: LLM demonstrated a strong performance in this study, with no responses rated as incorrect or partially incorrect, showing its great potential and feasibility for use in public health. Conclusions: The present findings suggested the potential feasibility of LLMs in public health for preventing paediatric injuries.