It’s critical to discuss the veracity of the musculoskeletal health information provided by large language model (LLM) chatbots, a type of artificial intelligence (AI) that powers ChatGPT, Google Bard, and BingAI, given their rising popularity. Three new studies evaluated the accuracy of how chatbots present research advancements and clinical decision making, analyzing the validity of the information they provided patients for specific orthopaedic procedures. These studies were presented at the American Academy of Orthopaedic Surgeons (AAOS) 2024 Annual Meeting.
Although the investigations discovered that some chatbots can summarise a variety of orthopaedic disorders succinctly, their accuracy varied depending on the category. Scholars concur that orthopaedic surgeons continue to be the most trustworthy information source. The results will aid professionals in comprehending the effectiveness of these AI tools, if patients’ or non-specialist colleagues’ use could introduce prejudice or misunderstandings, and how potential future improvements could turn chatbots into a useful tool for patients and doctors.
Overviews and results of studies
Potential misconceptions and risks related to using LLM chatbots in a clinical setting
Weill Cornell Medicine fourth-year medical student Branden Sosa led this study, which evaluated the efficacy of Open AI ChatGPT 4.0, Google Bard, and BingAI chatbots in addressing patient inquiries, integrating clinical data, and explaining fundamental orthopaedic concepts. After responding to 45 questions about orthopaedics in the categories of “Bone Physiology,” “Referring Physician,” and “Patient Query,” each chatbot was evaluated for accuracy. On a 4-point rating system, two impartial, blinded reviewers evaluated the responses for accuracy, completeness, and usability. The analysis of the responses looked at the advantages and disadvantages of the chatbots in each category. The following tendencies were observed by the study team:
In 76.7%, 33%, and 16.7% of the orthopedic questions that were posed, OpenAI ChatGPT, Google Bard, and BingAI, respectively, gave accurate responses that addressed the most important salient points.
All chatbots showed notable limits when it came to offering clinical management recommendations. They would depart from the recommended course of therapy and skip important workup procedures, including ordering antibiotics before cultures or failing to include important studies in the diagnostic workup.
ChatGPT and Google Bard were able to respond to less complicated patient questions with a high degree of accuracy, but they frequently missed important medical history information that was necessary to completely answer the question.
Ten broken links that either didn’t work or took users to the wrong articles were found among the oversampling of references supplied by chatbots, according to a thorough examination of the citations they produced.
Is ChatGPT prepared for a major role? Analyzing how well AI responds to frequently asked queries from patients undergoing arthroplasty
Researchers created a list of 80 frequently asked patient questions concerning knee and hip replacements in an effort to determine how accurately ChatGPT 4.0 responded to patient inquiries. The researchers were lead by Jenna A. Bernstein, MD, an orthopaedic surgeon at Connecticut Orthopaedics. Every question was asked twice in ChatGPT: once using the written form, and again requesting ChatGPT to respond to the patient’s inquiries “as an orthopaedic surgeon.” On a scale of one to four, each team member who practices surgery assessed the accuracy of each set of responses. The degree of agreement between the two surgeons’ assessments of every set of ChatGPT responses was measured. Two statistical analytic methods (the Wilcoxon signed-rank test and Cohen’s kappa, respectively) were used to evaluate the relationship between the response accuracy and the question prompt. Among the conclusions were:
When queried without a prompt, 26% (21 of 80 responses) had an average grade of three (partially accurate, but incomplete) or below when evaluating the quality of the ChatGPT responses; when given a prompt, 8% (six of 80 responses) had an average score of less than three. Consequently, the researchers concluded that more effort is required to develop an accurate chatbot with an orthopaedic focus, as ChatGPT is still insufficient as a resource for patient questions.
When given the right prompts, ChatGPT answered patient inquiries “as an orthopaedic surgeon” with 92% accuracy, which was a substantial improvement.
Can patients with inquiries about the Latarjet treatment for anterior shoulder instability get answers from ChatGPT 4.0?
Under the direction of Kyle Kunze, MD, researchers at the Hospital for Special Surgery in New York evaluated ChatGPT 4.0’s suitability for educating patients with anterior shoulder instability about the Latarjet operation. This study’s main objective was to determine if the chatbot has the capacity to function as a clinical adjunct and benefit patients and providers by giving reliable medical information.
The team’s initial step in providing an answer was to use the Google search term “Latarjet” to retrieve the top ten often asked questions (FAQs) and related resources about the process. They next instructed ChatGPT to carry out the identical FAQ search in order to locate the queries and sources that the chatbot had supplied. Among the findings’ highlights were:
ChatGPT proved to be able to offer a wide range of clinically pertinent queries and responses, and it consistently sourced data from scholarly sources. This contrasts with Google, which contained information from larger medical practices and a tiny fraction of academic resources together with information from individual surgeons’ websites.
Technical specifics was the most often asked question category on ChatGPT (40%) and Google (20%). Nevertheless, ChatGPT also provided information on risks and consequences (30%), recovery time (20%), and surgical evaluation (10%).