Tangent

A Plea to the Labs: Let the Models Diagnose.

When Anthropic released Fable 5, I excitedly went and tried a recent ECG case I found which hit the edge of medical LLM capabilities nicely. Gpt 5.5/Opus 4.8 can solve it consistently, but only in their respective harnesses, as they need to zoom in to relevant parts of the ECG, and even then, they only get it right if you include the patient history. This is interesting because it is possible (but very hard) to solve the case from the ECG alone, in fact one of the most senior cardiologists in my hospital did manage to do so when I curbsided him with it. In summary it is a hard case that nicely demonstrates the interplay between LLM vision capabilities, harnesses, and medical reasoning.

So it is immensely disappointing to find that it is outright impossible to get Fable to return a response, not just on ECG cases, but on any medical case where you ask it to make a diagnosis. If you pretend to be a patient looking for general medical advice it does actually make it through the guardrails, but even then the guardrails are on a hair-trigger. The moment there is any hint that you are explicitly looking for a diagnosis, or that the model gets the information it needs to actually make one, you are immediately booted down to Opus 4.8.

This is not an isolated event, but the ultimate endpoint of a long and, in my opinion, misguided paradigm which the labs have been pushed into by the press, by liability worries, and by some medical professionals. Basically, the paradigm is this: "LLMs shouldn't make diagnoses, because they might be wrong." On the surface this might sound reasonable, but it is not. At least not anymore. I would argue that it is profoundly unethical, and represents the worst of a paternalistic attitude that the medical profession has tried to move away from for decades. The Fable guardrails just happen to highlight exactly how misguided it is. "Oh, you want Fable to make a potentially lifesaving diagnosis? I'm afraid we can't have any of that, Fable just might do too good a job, so here is an attempt by a stupider model instead!". The guardrails are clearly a bit over-tuned at the moment, but the fact that they are tuned at all in this direction is a problem.

The reality of LLMs for medical diagnosis is this: the models are good. They are really, really good. Unfortunately, the literature hasn't caught up with how good they are yet. Most publications and news articles claiming they aren't good enough test what are by now ancient models. Take this one that got some press attention about models giving potentially dangerous advice when faced with problems that required immediate medical attention: it used GPT-4o, Llama 3, and Command R+ (for some reason). And even then the bad results that they get are usually a result of bad prompting. Models correctly identified the relevant conditions in 94.9% of cases, they just didn’t do so robustly when prompted badly. There are unfortunately no truly up-to-date benchmarks about medical performance beyond OpenAIs healthbench (which Fable seemed to score no better on than Opus 4.8, I suspect its saturated due to bad design), which only measures what a bunch of doctors thought the "right" way to think about a case was. Most basic medical Q&A benchmarks have long since been saturated, and real life data is sparse. There are some publications of clinics using very basic "pre-assesment" with what usually amounts to some non-reasoning, non agentic LLM system, and already at that point the models are usually rated favorably compared to MDs. Google has one framework it reports performs better than humans.

In any case, medicine, the only thing that truly matters is making the correct diagnosis, or, failing that, taking the correct steps to unveil it. There is no opinion to it; reality is the way it is and that's that, figure it out or die. And when you test LLMs on actual cases, models have long outperformed MDs. The benchmark I linked used contemporary cases to try to limit data contamination, and in my experience the models are just as good on genuinely fresh case reports. Just to illustrate my point, this benchmark is also the most up-to-date case-based benchmark I am aware of testing this issue, and the most recent frontier model it tested was o3 (alongside Gemini 2.5 Pro).

All of this does not mean that LLMs outperform an entire hospital system, which is what models are truly competing with, but it does mean that they are at least good enough to have an opinion on the matter. And to be fair, there are particular areas of medicine that seem to be slightly out of distribution. The example I used to use was a private image of a pathological blood smear, taken with a smartphone through a microscope and shared in the department as a puzzle case. It used to stump GPT-5.2 (which consistently misinterpreted it as CML when it was AML, an extremely consequential difference) but was solved by Gemini and Claude, which I thought was neat. These days, however, all the SOTA models (besides Fable, which of course refuses to answer) present very reasonable top-3 differentials when asked from their respective harnesses. So honestly I don't even know what's out of distribution anymore, surely there has to be something.

So, the models are good. They are so good that finding cases they can't solve correctly is starting to become difficult. And besides the potential consequences for my future job security, this is an amazingly positive development! Medicine is not about employing doctors, but about curing disease. Widespread access to excellent diagnostic capabilities is a good thing for humanity. People will always make stupid decisions, like poisoning themselves after misunderstanding cleaning for dietary advice given by GPT-3.5 (most likely). I have personally witnessed misunderstanding lead to catastrophic outcomes, not just when a patient misunderstands instructions, but when MDs misunderstand each other. Miscommunication is in fact the largest cause of medical errors, even within hospitals. This is how it always has been, and I have seen nothing to suggest that SOTA models will do anything but improve this state of affairs. By the most conservative widely-cited estimates, preventable medical errors kill somewhere between 44.000 and 98.000 (and this is not even the highest estimate) people yearly just in the US, and those cases usually concern far more basic errors than the problems I now use to challenge LLMs.

And besides, the medical profession has long since made peace with the fact that people make stupid decisions, and sometimes die because of it. The modern stance for how to deal with this is not paternalism, but shared decision-making, essentially accepting the fact that people will sometimes make decisions that we don't want them to. Sometimes out of ignorance, sometimes because their risk tolerance is higher than ours, and sometimes because they have a different view of life than the medical profession represents. We do our best, and that's all you can do, and this is fine. This perspective is entirely lost when the press chases headlines about people doing wacky things because ChatGPT told them to. In the end, LLMs have never been responsible for people's actions, just like MDs are not. You give your best advice and hope for the best. And the best advice LLMs can give is, in fact, really good.

It still makes a slight amount of sense for LLMs to hedge their medical advice ("I am not a doctor, but..."), seeing as it's still a slight skill issue to get the best possible response (the skill being to download a harness and set thinking effort to max), and there is probably some kind of law involved somewhere. But it does not make sense at all to aggressively align your model to never make diagnoses. In fact, this is patently unethical when the only thing you do in practice is downgrade your response. LLMs do in fact usually give somewhere between good and excellent medical advice. It is of course still possible for them to make mistakes; that is just in the nature of medicine. People are already dying every single day because of medical mistakes, and I think it is quite likely that all current SOTA LLMs would make fewer diagnostic mistakes than the average MD. They have long since passed the threshold where their "opinion" is at least worth consideration. Whether they are good enough to truly contribute remains to be seen, but locking them down on the mistaken notion that they should never even be allowed to try is fundamentally misguided. I fear a future in which model alignment against medicine cripples their diagnostic potential. I fear a future in which only "trusted" users are even allowed to ask actual medical questions. If all models were as "aligned" as Fable today, I suspect it would already lead to deaths that could otherwise have been prevented. Both medical doctors and patients are already using these tools, and I know many anecdotes and some evidence of this being for the better. The current state of the literature is that even models from two years ago were already equal or superior to medical doctors. The models are good. Please don’t cause harm because you can’t guarantee that they are perfect. At the very least let me test the goddamn thing. So, to all the labs: please, please, please stop aligning your models to refuse to make diagnoses. Stick with the "I am not a doctor" disclaimer, or better yet with the simple "LLMs sometimes make mistakes" disclaimer at the bottom of the page. It is more than enough.

NB: The post was fact and grammar checked by Opus 4.8 and GPT 5.5. Fable, of course, hit the guardrails and got cut off. At least the reasoning trace before that happened seemed to agree.