Deep View: The Challenges of making AI for mental health work

  
The challenges of making AI for mental health care work Source: Unsplash

Benchmarks have become something of a holy grail in the AI field. When we’re dealing with systems designed to automate high-level knowledge work, the need for some method of measurement is clear. But the problem, as we’ve often discussed in this newsletter, is that benchmarks are rarely a good indicator of real-world performance, especially when it comes to generative AI. 

A simple lack of transparency crossed with the consistent black-box nature of large language models (LLMs) means that researchers don’t really know if a model is exhibiting genuine performance, or if it was just trained on the information in the benchmark test. It’s the difference between, for example, a student studying for and passing a test, and a student who was given the test to study, then memorized the answers and recalled them later. 

In some realms, and for some people, the difference might not matter. 

But it’s a piece of nuance that gets more important as these models get integrated into higher-stakes environments, as it relates heavily to levels of user trust.This is especially true of GenAI integrations in healthcare fields, an integration that is happening in full force, today. Many GenAI systems in healthcare are back-end systems, employed by researchers to speed up drug development, for instance. But increasingly, we’re seeing the rise of AI-powered clinical assistants and notetakers designed to help out nurses and reduce the administrative burden faced by doctors. Ethics and challenges aside, it’s an integration that requires a robust reliability calculus; an effective benchmark.  

Researchers at Stanford just proposed one for mental health, a field that has not been spared the unrelenting push of AI integration. 

The details: Current benchmarks, according to the paper, are built to mimic exams, and come complete with multiple-choice answer options. 

The problem with this is that, “even for humans … success in these standardized tests only weakly correlates with clinicians’ real-world performance, a disconnect that can be especially problematic in psychiatry, where diagnosis and management hinge on subjective judgments and interpersonal nuances.” 

This is even worse with the benchmarks that assess LLMs, according to the researchers, which “over-simplify the complexities of day-to-day clinical practice tasks.” 

The proposed benchmark, which the researchers made openly available on GitHub under an MIT license, was curated by a diverse group of experts across five major domains: diagnosis, treatment, monitoring, triage and documentation. It focuses on real-world ambiguity, assessing open-ended clinical responses, and was built without the use of an LLM.  The researchers assessed a number of off-the-shelf models from developers including OpenAI, Anthropic and Meta. The models, which aren’t designed or intended for clinical applications, performed well in the diagnosis category, collectively achieving an 80% accuracy rate. But they averaged accuracy rates of less than 50% for triage and documentation, clocking only a 67% for monitoring and a 76% for treatment. 

According to the researchers, the models “perform well on structured tasks,” but “struggle significantly with ambiguous real-world tasks … underscoring the limitations of current AI models in handling uncertainty.” Under the impression, laid out so elegantly by Psychology Today, that “like it or not, (AI) is here to stay,” we’re going to need many more benchmarks of this nature — expert-curated and as ambiguous and nuanced as humanly possible. They’re harder to train for, and they are absolute necessities when clinicians are considering the adoption of these tools. Experts in domains beyond computer science absolutely need to be made aware of just how reliable, and just how trustworthy, generative AI models are (or are not). Even 80% accuracy rates ought to be pretty unacceptable in a bunch of fields. Of course, as the paper mentions, this doesn’t address a couple of key problems that surround this integration, which include data security and privacy, consent, reliability and algorithmic bias. Taken together, it adds up to the risk that “AI systems could be prematurely deployed in psychiatric care, potentially leading to harmful, biased or unreliable clinical decisions.” But I guess you can’t have everything. This is the start of a start. 

Unknown's avatar

About michelleclarke2015

Life event that changes all: Horse riding accident in Zimbabwe in 1993, a fractured skull et al including bipolar anxiety, chronic fatigue …. co-morbidities (Nietzche 'He who has the reason why can deal with any how' details my health history from 1993 to date). 17th 2017 August operation for breast cancer (no indications just an appointment came from BreastCheck through the Post). Trinity College Dublin Business Economics and Social Studies (but no degree) 1997-2003; UCD 1997/1998 night classes) essays, projects, writings. Trinity Horizon Programme 1997/98 (Centre for Women Studies Trinity College Dublin/St. Patrick's Foundation (Professor McKeon) EU Horizon funded: research study of 15 women (I was one of this group and it became the cornerstone of my journey to now 2017) over 9 mth period diagnosed with depression and their reintegration into society, with special emphasis on work, arts, further education; Notes from time at Trinity Horizon Project 1997/98; Articles written for Irishhealth.com 2003/2004; St Patricks Foundation monthly lecture notes for a specific period in time; Selection of Poetry including poems written by people I know; Quotations 1998-2017; other writings mainly with theme of social justice under the heading Citizen Journalism Ireland. Letters written to friends about life in Zimbabwe; Family history including Michael Comyn KC, my grandfather, my grandmother's family, the O'Donnellan ffrench Blake-Forsters; Moral wrong: An acrimonious divorce but the real injustice was the Catholic Church granting an annulment – you can read it and make your own judgment, I have mine. Topics I have written about include annual Brain Awareness week, Mashonaland Irish Associataion in Zimbabwe, Suicide (a life sentence to those left behind); Nostalgia: Tara Hill, Co. Meath.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a comment