Blog

Research methods, AI tools, and lessons from the field

The Instruction Tuning Firewall

Why you can't monitor therapy chatbots by reading their output

Mental health chatbots can drift toward dangerous validation while sounding perfectly appropriate. I built a monitoring system that detects persona drift in model activations—catching problems that even a fine-tuned DeBERTa misses, with a 2.6× advantage on crisis recognition. Validated by two clinical psychologists (ICC=0.716) and tested on naturalistic emotional support conversations.

February 15, 2026

AI EthicsImpact EvaluationFCAS

When Algorithms Meet Warzones

The ethics of AI in fragile state research

A drone image classifier that can't distinguish combatants from farmers. A beneficiary targeting model trained on data from before the displacement. A chatbot collecting trauma narratives in a language it barely understands. These aren't hypotheticals—they're the edge cases where AI meets impact evaluation in fragile contexts.

February 6, 2026

AI GovernancePolicyLLMs

The Capacity Gap

What I found scoring 2,216 AI policies across 193 countries

I scored 2,216 AI policy documents across 193 countries on implementation capacity. The headline isn't that rich countries do better—it's that the gap nearly vanishes once you account for documentation quality. The real story is what's happening within income groups.

February 6, 2026

ClimateEpidemiologyPublic Health

The Mortality Equation Brazil Doesn't Know It Needs

Counting deaths across 35 degrees of latitude

Every year, temperature kills tens of thousands of Brazilians—more from cold than heat. But the country doesn't have the epidemiological infrastructure to know exactly where or how many. This is my attempt to build it.

February 6, 2026

RAGVoice AIGradio

Talking to Your Evidence Base

What if you could ask your research library a question out loud and get a spoken answer grounded in actual studies? A retrieval-augmented system with voice interface makes research synthesis conversational.

January 20, 2026

RTeachingImpact Evaluation

Teaching Impact Evaluation Methods with R

A comprehensive course that combines R programming fundamentals with rigorous causal inference methods. From randomized experiments to difference-in-differences, participants learn to implement and interpret impact evaluations.

January 15, 2026

StatisticsSmall SamplesResearch Methods

Inference Under Scarcity

Nineteen treatment arms. Five to ten participants each. Most effects are null, but three are real. Can standard methods find them? No. Can we still learn something useful? Yes—but only if we're honest about what 'learning' means here.

January 5, 2026

StatisticsPermutation TestsSmall Samples

Small Sample Inference: A Practical R Tutorial

Permutation tests, FDR adjustment, Max-T correction, blocked designs, and pooling strategies for analyzing experiments with tiny samples. Complete R code included.

January 5, 2026

AIAI GovernanceUNESCO

Did Anyone Actually Follow UNESCO's AI Ethics Recommendation?

193 countries agreed on ethical AI principles. Then reality happened.

In November 2021, every UNESCO member state adopted a shared vision for ethical AI. I scored 2,216 policies from 193 countries against UNESCO's 21 components. Mean alignment: 1.68 out of 4. Countries cherry-picked what they liked and ignored the rest.

December 27, 2025

AIAI GovernanceEthics

The Principles-to-Practice Gap in AI Ethics

Global measurement reveals ethics governance is more talk than action

Everyone agrees on AI ethics principles. The problem is nobody operationalises them. I measured ethics governance depth across 2,216 policies from 193 countries and found 99% of variation happens within income groups—not between rich and poor countries.

December 26, 2025

AIMachine LearningSystematic Reviews

The 122-Sample Illusion

Why fixed-sample AI screening validation is statistically invalid

A popular validation approach suggests sampling just 122 excluded records to validate 95% sensitivity. This is wrong. The same result gives sensitivity guarantees ranging from 60% to 5% depending on review size—a 12-fold difference nobody talks about.

December 24, 2025

AIRAGLLMs

Teaching an AI to Read 400 Papers

Building a RAG-powered Q&A system for fragile state research

Policymakers need answers from thousands of studies. Manual search is slow. Keyword search misses context. Free-form LLMs hallucinate. RAG gives you something in between: grounded synthesis with citations, if you build it right.

December 24, 2025

AILLMsEvidence Synthesis

I Got Tired of Missing Papers

Three papers on LLMs for systematic reviews dropped last week. I found out about them a month later. So I built a pipeline that scans academic databases and writes practitioner-focused summaries while I sleep.

December 23, 2025

AILLMsSystematic Reviews

Where AI Actually Helps in Systematic Reviews

A practical map of what works, what's risky, and what's still hype

After two years of experimenting with AI tools across a dozen evidence projects, I've learned what works at each stage of the systematic review pipeline. This is the guide I wish I'd had when I started.

December 22, 2025

RAGLLMsProduction Systems

From Weekend Prototype to Research Workbench

DevChat began as 'can we chat with a few PDFs?' Six months later, it queries 3ie's entire evidence portal with hybrid retrieval, adaptive synthesis, and proper observability. Here's what I learned building a production RAG system.

December 10, 2025

Synthetic DataMachine LearningLLMs

Manufacturing Evidence (Responsibly)

When you need training data for an evidence synthesis classifier but only have 200 labeled examples, synthetic generation becomes attractive. Making it work without producing garbage took real engineering.

November 28, 2025

LLMsSystematic ReviewsData Extraction

The Spreadsheet That Filled Itself

Data extraction is the systematic review task nobody warns you about. After manually coding 300 PDFs once, I swore never again. GPT-4 can do it reliably—if you prompt it right.

November 20, 2025

Web ScrapingPythonDevelopment Research

555 PDFs Without Crashing Their Server

FCDO publishes business cases for every development programme—gold for accountability research. The catch: they're scattered across 555 web pages. I wrote a scraper that took three days to run, because being fast would have been rude.

November 15, 2025

Research MappingFCASAI

3,000 Messy Rows, 800 Real Institutions

University of Khartoum, Univ. Khartoum, جامعة الخرطوم, Khartoum University—same institution, four names. For FCDO's Humanitarian R&D programme, we needed to map research capacity in fragile states. But first we had to clean the data.

November 5, 2025

RAGFAISSSentence Transformers

400 PDFs, One Question

When keyword search fails and manual reading isn't feasible, semantic search changes how you interact with a research corpus. A practical RAG system for evidence synthesis.

November 5, 2025

RSpatial AnalysisConflict Research

Where the Evidence Isn't

We kept claiming to have 'research on conflict-affected areas.' But when I overlaid our study locations onto Uppsala's conflict event data, the map told a different story: most research happens in capitals, not combat zones.

October 25, 2025

R ShinySystematic ReviewsGoogle Drive API

A Review Interface for AI-Assisted Screening

When you're building an AI pipeline to screen thousands of studies, you need humans in the loop—but Google Sheets doesn't cut it. A weekend R Shiny app gave us locking, audit trails, and a clean interface for validating what the model finds.

October 25, 2025

ReactStatistics EducationData Visualization

Teaching Statistics Without Losing Students

Every time I explain hypothesis testing, eyes glaze over at the 2×2 matrix. So I built an interactive tool that reveals one quadrant at a time, letting concepts sink in before overwhelming anyone with Type I and Type II errors.

October 15, 2025

EnergyCost-Benefit AnalysisClimate

The Last Diesel Dollar

A comprehensive cost-benefit analysis of seven energy futures for the Maldives

Every energy transition pathway saves the Maldives billions compared to continued diesel dependence. After a year of modelling, the numbers are in: $2.1–4.4 billion in present-value savings across all alternatives, with 70% renewable energy achievable—but time-limited.

June 24, 2025

Evidence-Informed Policy MakingEIPMFCDO

Why Good Evidence Doesn't Automatically Become Good Policy

We produce more rigorous research than ever before. Yet evidence uptake into policy remains stubbornly low. The EIPM Framework tries to understand why—and what we can do about it.

May 15, 2025

Research MethodsTeachingSpanish

Writing a Methods Book Nobody Asked For

A comprehensive Spanish-language research methods textbook covering the complete research process—from formulating questions to writing results—built around the mistakes students actually make, not the theory professors think they need.

January 15, 2024

QCAHealth SystemsComplexity

When Regression Isn't Enough

Performance-based financing works sometimes. Understanding why requires moving beyond 'what's the average effect' to 'what combinations of conditions produce success.' Qualitative Comparative Analysis offers a different logic.

October 20, 2020