Blog

Research methods, AI tools, and lessons from the field

The Instruction Tuning Firewall
AI SafetyMental HealthLLMs

The Instruction Tuning Firewall

Why you can't monitor therapy chatbots by reading their output

Mental health chatbots can drift toward dangerous validation while sounding perfectly appropriate. I built a monitoring system that detects persona drift in model activations—catching problems that even a fine-tuned DeBERTa misses, with a 2.6× advantage on crisis recognition. Validated by two clinical psychologists (ICC=0.716) and tested on naturalistic emotional support conversations.

February 15, 2026