Where AI Actually Helps in Systematic Reviews
AILLMsSystematic ReviewsEvidence SynthesisResearch Methods

Where AI Actually Helps in Systematic Reviews

A practical map of what works, what's risky, and what's still hype

December 22, 2025

Code for screening and extraction pipelines is at github.com/lsempe77/paper-screening-pipeline. For a deep dive on data extraction specifically, see The Spreadsheet That Filled Itself.


The Honest Answer

“Can AI do systematic reviews now?”

I get this question constantly. The honest answer is: AI can accelerate parts of the pipeline dramatically, make other parts marginally easier, and will actively hurt you in a few places if you trust it blindly. The nuance matters, and most overviews I’ve seen either oversell (“AI will replace reviewers!”) or undersell (“AI is too error-prone for serious research”).

After using these tools on about a dozen evidence synthesis projects—some successfully, some not—I have a practical sense of where they fit. This is the map I wish someone had given me two years ago.


The systematic review pipeline has roughly twelve stages. At each stage, the question is: what can AI do, what’s the risk, and is it worth the overhead?

Defining the research question. This is fundamentally a human task, but LLMs make useful sparring partners. Describe your topic in plain language and ask Claude or GPT-4 to help structure it as PICO (Population, Intervention, Comparison, Outcome). The model will surface framings you hadn’t considered. It’s brainstorming, not automation. Risk is low because you’re not taking the output as final—you’re using it to think.

Writing the protocol. LLMs can draft protocol sections from templates, especially the boilerplate (data management, dissemination plans). They can check your draft against the PRISMA-P checklist and flag missing elements. I wouldn’t trust an LLM to write the methods unsupervised, but for generating a first draft that humans then refine, it saves time.

Developing the search strategy. This is where I’ve seen both successes and disasters. LLMs are good at generating synonyms and related terms—ask for “all the ways researchers might describe cash transfer programs” and you’ll get a useful list. They’re reasonably good at constructing Boolean syntax. But they hallucinate database-specific operators, invent field codes that don’t exist, and confidently produce searches that look correct but miss swathes of literature. Use them for ideation, then have a librarian or information specialist validate.


Retrieving references. AI doesn’t help here. You’re just running queries against databases. This is plumbing.

Screening titles and abstracts. This is the killer app. Screening 10,000 abstracts manually takes 200-400 person-hours. AI-assisted screening can reduce workload by 50-80% while maintaining high sensitivity, if you set it up correctly.

The key is using AI as a filter, not a decision-maker. Train a model (or use a tool like ASReview) on your first few hundred screened records, then let it prioritize the remaining records by predicted relevance. Screen the high-probability records first. Stop when you hit a threshold of consecutive irrelevant records—by then, you’ve likely seen all the relevant ones.

The risk is false negatives: the AI excludes a relevant study that you never see. This is why you need validation—sample from the AI’s exclusions and check. If you’re finding relevant studies in the excluded pile, your threshold was wrong.

Retrieving full texts. AI doesn’t help much here either. You’re clicking through library systems and emailing authors. Some tools automate the clicking, but that’s not AI—it’s scraping.


Extracting data. This is my second favorite use case. Extracting structured fields from hundreds of papers is tedious and error-prone. LLMs can do it faster and more consistently, with the right prompting.

The trick is specificity. Don’t ask “extract the metadata”—ask “extract the sample size, specifying whether it’s the enrolled sample, analyzed sample, or intent-to-treat sample, and provide the page number where you found it.” JSON output format, explicit examples for edge cases, and a request for the model’s reasoning all improve accuracy.

I see about 85-90% accuracy on most fields, higher for bibliographic data, lower for interpretive judgments. Use it as a first pass that humans verify.

Critical appraisal. This is risky. Risk-of-bias assessment requires reading between the lines—understanding what “adequate” randomization means, detecting selective reporting, judging whether blinding was plausible. LLMs are inconsistent here. They’ll confidently mark a study as “low risk” when the allocation concealment is ambiguous.

Use LLMs to pre-fill the assessment form with citations to relevant text, but have humans make the final judgment. Don’t automate this stage.


Synthesis and meta-analysis. For narrative synthesis, LLMs can help structure the argument and ensure you’ve addressed all included studies. For meta-analysis, they’re useless—you need specialized statistical software, not a language model.

The risk in narrative synthesis is that the LLM smooths over contradictions. If your studies disagree, a good synthesis highlights and explains the disagreement. An LLM might produce a fluent paragraph that papers over the conflict. Always compare the LLM’s synthesis to your own reading of the primary studies.

Writing and editing. LLMs are excellent editors. Paste in your draft; ask for clarity improvements, jargon reduction, or structural suggestions. I use Claude for this constantly. The output isn’t final copy, but it surfaces issues I’d otherwise miss.

Dissemination. Creating plain-language summaries, policy briefs, social media threads—LLMs handle this well. They’re good at register shifts. A technical finding becomes an accessible explanation. Just verify that the simplification didn’t introduce inaccuracy.


The meta-point is that AI tools are amplifiers, not replacements. They make good reviewers faster. They don’t make bad reviewers good. If you don’t understand methodology, you can’t verify the LLM’s methodology extraction. If you don’t know what a rigorous search strategy looks like, you can’t catch the LLM’s hallucinated operators.

I’ve seen projects where AI tools saved 60% of time with no loss in quality. I’ve also seen projects where over-reliance on AI produced reviews with systematic errors that took months to fix. The difference wasn’t the tools—it was whether the humans using them understood the task well enough to verify the output.

Start with the stages where risk is low (protocol drafting, literature monitoring, writing assistance) before moving to higher-risk stages (screening, extraction). Build confidence in the tools’ failure modes. And never submit anything an LLM produced without human verification.


Stage 1: Define Research Question

Defining a precise research question is the hardest part of any review. AI tools can act as a sparring partner here, helping to translate vague natural language ideas into structured PICO (Population, Intervention, Comparison, Outcome) frameworks. Beyond just refinement, they can help identify existing systematic reviews on the same topic—saving you from duplicating work that’s already been done.

Tools & Resources

ToolUse Case
ChatGPT/ClaudeBrainstorm and refine PICO elements
Consensus.appSearch for existing reviews on your topic
PROSPEROCheck for registered protocols

Example Prompt

I want to study the effects of cash transfer programs on poverty 
reduction in Sub-Saharan Africa. Help me define:
- Population (P)
- Intervention (I)  
- Comparison (C)
- Outcomes (O)

Stage 2: Write Protocol & Register

Once the question is set, writing the protocol ensures transparency and rigor. LLMs are excellent at drafting standard protocol sections based on templates and checking your draft against the PRISMA-P checklist for completeness. However, never let an AI write the methodology unsupervised—use it to generate drafts that you refine and validate carefully.

Tools & Resources

ToolUse Case
ChatGPT/ClaudeDraft protocol sections
PROSPEROProtocol registration
PRISMA-P ChecklistEnsure completeness

Best Practice

Always have humans review AI-generated protocol text. Use AI for first drafts, not final versions.


Stage 3: Develop Search Strategy

What AI Can Do

  • Generate search terms and synonyms
  • Build Boolean logic strings
  • Translate searches across databases (PubMed → Scopus → Web of Science)

Tools & Resources

ToolUse Case
ChatGPT/ClaudeGenerate search terms, Boolean strings
Polyglot SearchTranslate searches across databases
PubMed MeSH BrowserIdentify controlled vocabulary

Example Prompt

Generate a comprehensive search strategy for PubMed to find 
studies on "school feeding programs" and "academic achievement" 
in low-income countries. Include:
- MeSH terms
- Free text synonyms
- Boolean operators

Stage 4: Retrieve References

What AI Can Do

  • Automate reference downloads from multiple databases
  • Deduplicate across sources
  • Enrich metadata (add abstracts, DOIs)

Tools & Resources

ToolUse Case
OpenAlex APIFree, comprehensive reference retrieval
Europe PMC APIOpen access full texts
UnpaywallFind open access versions
ASReviewImport and manage references

Pro Tip

Use OpenAlex’s free API for comprehensive coverage—it includes Crossref, PubMed, and more.


Stage 5: Screen Titles & Abstracts ⭐

What AI Can Do

  • Priority screening: Rank references by relevance
  • Dual screening replacement: AI as second reviewer
  • Stopping rules: Determine when to stop screening

Tools & Resources

ToolUse CaseAI Model
ASReviewActive learning for screeningMultiple
RayyanCollaborative screeningProprietary
AbstrackrMachine learning screeningML-based
Nested KnowledgeAI-assisted screeningProprietary
CovidenceEnd-to-end review platformML-based

The State of the Art

Recent research shows:

  • AI can reduce screening workload by 36-72% while maintaining sensitivity
  • Dual screening with AI achieves comparable accuracy to human-human
  • Stopping rules help determine when you’ve found enough (see my other post on this!)

Validation is essential! Always validate AI screening on a random sample before full deployment. See my tutorial on Why Fixed-Sample AI Screening Validation Fails.


Stage 6: Retrieve Full Texts

What AI Can Do

  • Automatically find PDFs from DOIs and URLs
  • Extract text from PDFs for downstream processing
  • Identify supplementary materials

Tools & Resources

ToolUse Case
Unpaywall APIFind open access PDFs
CORE APIRepository full texts
Sci-Hub(Use with caution, ethical considerations)
PyMuPDF/pdfplumberExtract text from PDFs

Stage 7: Extract Data ⭐

What AI Can Do

  • Extract structured data from full-text articles
  • Identify PICO elements, sample sizes, effect sizes
  • Handle tables and figures

Tools & Resources

ToolUse Case
GROBIDStructure extraction from PDFs
RobotReviewerRCT data extraction
MetaBeeAIFull pipeline for bio-reviews
ChatGPT/ClaudeCustom extraction prompts

Example Prompt

Extract the following from this study:
- Study design (RCT, quasi-experimental, observational)
- Sample size (total, treatment, control)
- Country/setting
- Intervention description
- Primary outcomes and effect sizes
- Follow-up duration

Format as JSON.

Accuracy Benchmarks

Recent evaluations show:

  • Data extraction: >84% accuracy, F1 >90%
  • Study characteristics: Generally reliable
  • Effect sizes: Requires verification

Stage 8: Critical Appraisal

What AI Can Do

  • Assess risk of bias using standard tools (RoB 2, ROBINS-I)
  • Flag potential quality issues
  • Suggest domain-specific concerns

Tools & Resources

ToolUse Case
RobotReviewerAutomated RoB assessment
ChatGPT/ClaudeRoB domain evaluation
Custom promptsTool-specific assessment

Current Limitations

  • Interpretive tasks are challenging for AI
  • RoB assessment often requires judgment, not just extraction
  • Recommendation: Use AI as first-pass, human verification essential

Stage 9: Synthesis of Results

What AI Can Do

  • Summarize findings across studies
  • Identify patterns and themes
  • Generate narrative synthesis drafts

Tools & Resources

ToolUse Case
ChatGPT/ClaudeNarrative synthesis drafts
Custom RAG systemsGrounded synthesis from your evidence
ElicitResearch assistant for synthesis

Best Practice

Build a Retrieval-Augmented Generation (RAG) system for synthesis. This grounds the LLM in your actual evidence, reducing hallucinations. See my tutorial on building a RAG system for research.


Stage 10: Meta-Analysis (Optional)

What AI Can Do

  • Suggest appropriate statistical methods
  • Help with R/Python code for meta-analysis
  • Interpret heterogeneity and sensitivity analyses

Tools & Resources

ToolUse Case
R (meta, metafor)Statistical analysis
Python (PythonMeta)Alternative to R
ChatGPT Code InterpreterGenerate and debug analysis code

Example Prompt

I have effect sizes (standardized mean differences) from 15 RCTs 
on educational interventions. Help me:
1. Conduct a random-effects meta-analysis in R
2. Create a forest plot
3. Assess heterogeneity (I², Q-statistic)
4. Run publication bias tests (funnel plot, Egger's test)

Stage 11: Write Results

What AI Can Do

  • Draft methods and results sections
  • Generate PRISMA flow diagrams
  • Format references consistently

Tools & Resources

ToolUse Case
ChatGPT/ClaudeDraft writing
PRISMA 2020Flow diagram generator
Zotero/MendeleyReference management

Stage 12: Dissemination

What AI Can Do

  • Generate plain-language summaries
  • Create social media content
  • Translate findings for different audiences

Tools & Resources

ToolUse Case
ChatGPT/ClaudePlain-language summaries
Podcast generationAudio summaries (experimental)
Infographic toolsVisual summaries

Summary: AI Readiness by Stage

StageAI ReadinessRecommendation
Define Question🟢 HighUse freely with human review
Write Protocol🟡 MediumDraft assistance only
Search Strategy🟢 HighExcellent for term generation
Retrieve References🟢 HighFully automatable
Screen Abstracts🟢 HighMajor time-saver, validate!
Retrieve Full Texts🟢 HighFully automatable
Extract Data🟡 MediumGood accuracy, verify numbers
Critical Appraisal🔴 LowHuman judgment essential
Synthesis🟡 MediumUse RAG for grounding
Meta-Analysis🟡 MediumCode assistance helpful
Write Results🟡 MediumDraft assistance only
Dissemination🟢 HighExcellent for summaries

The Short Version

Screening is where AI actually saves time—50-70% workload reduction if you validate properly. Extraction works but verify the numbers. Critical appraisal still needs human judgment. For synthesis, use RAG or you’ll get hallucinations. And validate everything, because AI errors propagate.


Resources

The screening and extraction pipeline code is at github.com/lsempe77/paper-screening-pipeline. For a deep dive on the extraction stage, see The Spreadsheet That Filled Itself.

This guide is based on my experience leading systematic reviews at 3ie and developing AI tools for evidence synthesis. For questions or collaboration, please reach out.