The challenges of making AI for mental health care work Source: UnsplashBenchmarks have become something of a holy grail in the AI field. When we’re dealing with systems designed to automate high-level knowledge work, the need for some method of measurement is clear. But the problem, as we’ve often discussed in this newsletter, is that benchmarks are rarely a good indicator of real-world performance, especially when it comes to generative AI. A simple lack of transparency crossed with the consistent black-box nature of large language models (LLMs) means that researchers don’t really know if a model is exhibiting genuine performance, or if it was just trained on the information in the benchmark test. It’s the difference between, for example, a student studying for and passing a test, and a student who was given the test to study, then memorized the answers and recalled them later. In some realms, and for some people, the difference might not matter. But it’s a piece of nuance that gets more important as these models get integrated into higher-stakes environments, as it relates heavily to levels of user trust.This is especially true of GenAI integrations in healthcare fields, an integration that is happening in full force, today. Many GenAI systems in healthcare are back-end systems, employed by researchers to speed up drug development, for instance. But increasingly, we’re seeing the rise of AI-powered clinical assistants and notetakers designed to help out nurses and reduce the administrative burden faced by doctors. Ethics and challenges aside, it’s an integration that requires a robust reliability calculus; an effective benchmark. Researchers at Stanford just proposed one for mental health, a field that has not been spared the unrelenting push of AI integration. The details: Current benchmarks, according to the paper, are built to mimic exams, and come complete with multiple-choice answer options. The problem with this is that, “even for humans … success in these standardized tests only weakly correlates with clinicians’ real-world performance, a disconnect that can be especially problematic in psychiatry, where diagnosis and management hinge on subjective judgments and interpersonal nuances.” This is even worse with the benchmarks that assess LLMs, according to the researchers, which “over-simplify the complexities of day-to-day clinical practice tasks.” The proposed benchmark, which the researchers made openly available on GitHub under an MIT license, was curated by a diverse group of experts across five major domains: diagnosis, treatment, monitoring, triage and documentation. It focuses on real-world ambiguity, assessing open-ended clinical responses, and was built without the use of an LLM. The researchers assessed a number of off-the-shelf models from developers including OpenAI, Anthropic and Meta. The models, which aren’t designed or intended for clinical applications, performed well in the diagnosis category, collectively achieving an 80% accuracy rate. But they averaged accuracy rates of less than 50% for triage and documentation, clocking only a 67% for monitoring and a 76% for treatment. According to the researchers, the models “perform well on structured tasks,” but “struggle significantly with ambiguous real-world tasks … underscoring the limitations of current AI models in handling uncertainty.” Under the impression, laid out so elegantly by Psychology Today, that “like it or not, (AI) is here to stay,” we’re going to need many more benchmarks of this nature — expert-curated and as ambiguous and nuanced as humanly possible. They’re harder to train for, and they are absolute necessities when clinicians are considering the adoption of these tools. Experts in domains beyond computer science absolutely need to be made aware of just how reliable, and just how trustworthy, generative AI models are (or are not). Even 80% accuracy rates ought to be pretty unacceptable in a bunch of fields. Of course, as the paper mentions, this doesn’t address a couple of key problems that surround this integration, which include data security and privacy, consent, reliability and algorithmic bias. Taken together, it adds up to the risk that “AI systems could be prematurely deployed in psychiatric care, potentially leading to harmful, biased or unreliable clinical decisions.” But I guess you can’t have everything. This is the start of a start. |
-
Archives
- December 2025
- November 2025
- October 2025
- September 2025
- August 2025
- July 2025
- June 2025
- May 2025
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- March 2015
- January 2015
-
Meta
Source: Unsplash
Under the impression, laid out so elegantly by