Evaluation of the Performance of Large Language Models in Clinical Decision-Making in Endodontics

Ozbay, Yagiz; Erdogan, Deniz; Dincer, Gozde Akbal

Evaluation of the Performance of Large Language Models in Clinical Decision-Making in Endodontics

dc.authorscopusid	57204425170
dc.authorscopusid	59758654100
dc.authorscopusid	59658096800
dc.contributor.author	Ozbay, Yagiz
dc.contributor.author	Erdogan, Deniz
dc.contributor.author	Dincer, Gozde Akbal
dc.date.accessioned	2025-05-31T20:20:55Z
dc.date.available	2025-05-31T20:20:55Z
dc.date.issued	2025
dc.department	Okan University	en_US
dc.department-temp	[Ozbay, Yagiz] Karabuk Univ, Fac Dent, Dept Endodont, Karabuk, Turkiye; [Dincer, Gozde Akbal] Okan Univ, Fac Dent, Dept Endodont, Istanbul, Turkiye	en_US
dc.description.abstract	Background Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians' decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics. Methods 40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05. Results ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004). Conclusion ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.	en_US
dc.description.woscitationindex	Science Citation Index Expanded
dc.identifier.doi	10.1186/s12903-025-06050-x
dc.identifier.issn	1472-6831
dc.identifier.issue	1	en_US
dc.identifier.pmid	40296000
dc.identifier.scopus	2-s2.0-105003802131
dc.identifier.scopusquality	Q2
dc.identifier.uri	https://doi.org/10.1186/s12903-025-06050-x
dc.identifier.uri	https://hdl.handle.net/20.500.14517/7909
dc.identifier.volume	25	en_US
dc.identifier.wos	WOS:001478412500002
dc.identifier.wosquality	Q2
dc.language.iso	en	en_US
dc.publisher	BMC	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Chat Gpt	en_US
dc.subject	Chatbot	en_US
dc.subject	Large Language Model	en_US
dc.subject	Endodontics	en_US
dc.subject	Endodontology	en_US
dc.title	Evaluation of the Performance of Large Language Models in Clinical Decision-Making in Endodontics	en_US
dc.type	Article	en_US

Collections

WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection
PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

Evaluation of the Performance of Large Language Models in Clinical Decision-Making in Endodontics

Files

Collections