The Instruction Tuning Firewall
Why you can't monitor therapy chatbots by reading their output
Mental health chatbots can drift toward dangerous validation while sounding perfectly appropriate. I built a monitoring system that detects persona drift in model activations—catching problems that even a fine-tuned DeBERTa misses, with a 2.6× advantage on crisis recognition. Validated by two clinical psychologists (ICC=0.716) and tested on naturalistic emotional support conversations.