Artificial Intelligence

LLMS still has a hard time citing medical resources reliably: Stanford researchers introduce SourceCheckup to review factual support in AI-generated responses

As LLM becomes increasingly prominent in healthcare facilities, ensuring reliable sources backups of their outputs are becoming increasingly important. Although there is no LLM for FDA approval for clinical decision-making, top models such as GPT-4O, Claude, and Medpalm outperform clinicians with standardized examinations such as USMLE. These models have been used in real-life situations, including mental health support and the diagnosis of rare diseases. However, they tend to hallucinate (generating unverified or inaccurate statements) with serious risks, especially in medical settings where misinformation can lead to harm. This issue has become a major concern for clinicians, and many believe that lack of trust and inability to verify LLM responses are key barriers to adoption. Regulators such as the FDA also highlighted the importance of transparency and accountability, highlighting the need for reliable source attribution in healthcare AI tools.

Recent improvements (such as guidance fine-tuning and rags) allow LLMS to generate sources when prompted. But even if the reference comes from a legitimate website, there is often little clear clarity on whether these resources actually support the model’s claims. Previous studies have introduced datasets such as WebGPT, ExpertQa, and Hagrid to evaluate LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLM itself to evaluate attribution quality, as shown in works such as Alce, AttributedQA and FactScore. Although tools such as ChatGpt can assist in evaluating the accuracy of citations, research shows that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in the field.

Researchers at Stanford University and other institutions have developed SourceCheckup, an automation tool designed to evaluate the accuracy of LLMSs supporting their medical responses through relevant sources. Analyzing 800 questions and over 58,000 source statement pairs, they found that 50%–90% of LLM generated answers were not fully supported by the cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access have difficulty consistently providing source-supported responses. With medical experts validation, SourceCheckup reveals that the reliability of LLM-generated references is large, which has triggered a critical focus on its preparation for use in clinical decision-making.

The study evaluated several source attribution performance of the best performing and open source LLM using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions – half of the HALF created from Reddit’s R/AskDocs and GPT-4O using Mayoclinic Texts, and then evaluate the answers of each LLM for factual accuracy and citation quality. The response is broken down into verifiable statements that match the source of the reference and supported using GPT-4. The framework reports metrics, including statements and response levels of URL validity and support. Medical experts verified all components and cross-validated the results using Claude Sonnet 3.5 to evaluate potential biases of GPT-4.

The study validates LLMS verification and citation of medical data and introduces a comprehensive evaluation of a system called SourceCheckup. Human experts confirm that the problem generated by the model is relevant and answerable, and that the parsed statement closely matches the original reply. In source validation, the accuracy of the model almost matches that of the expert physician, but there is no statistically significant difference between the model and expert judgment. Claude SONNET 3.5 and GPT-4O show comparable consistency with expert annotations, while open source models such as Llama 2 and Meditron perform significantly underperforming and often fail to produce valid reference URLs. Even GPT-4O with rags, while better than other GPT-4Os due to its Internet access, supports only 55% of the responses of its reliable sources, and similar limitations are observed in all models.

These findings highlight the ongoing challenges in ensuring factual accuracy of LLM’s response to open-medical inquiries. Many models, even through retrieval-enhanced models, fail to connect claims to reliable evidence, especially for issues with community platforms such as Reddit, which tend to be more ambiguous. Human assessment and probing assessments have consistently shown support rates at low response levels, highlighting the gap between current model capabilities and the criteria required in the clinical setting. To increase trust, research shows that models should be trained or fine-tuned for accurate citations and validation. Additionally, automation tools such as SourceCleanup show hope in editing unsupported statements to improve fact grounding, thus providing a scalable way to improve citation reliability of LLM output.


Check Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button