How the Psychedelics Knowledge Graph is assembled: what literature is
searched, how papers are screened, how evidence is extracted, and how
everything is maintained.
Why
We are entering the era of agentic science, where AI agents can support
many steps of knowledge generation. This creates an opportunity to address
long-standing inefficiencies in academic workflows: findings are
scattered across papers, evidence is split across disciplinary silos,
and reviews require substantial expert labor, become static snapshots
once published, and are difficult to query or reuse. As a result, human
researchers struggle to see relationships, trends, gaps, and the overall
shape of a field, while AI agents cannot easily use the evidence
directly.
Agentic tools make it possible to build a new type of living evidence
system, where literature discovery, screening, extraction,
visualization, and updating are part of a continuous workflow.
Findings can be converted into structured, provenance-rich records
that remain linked to their source papers and evidence locators, so
the evidence base can be searched, corrected, reused, and extended as
the literature changes.
The Psychedelics Knowledge Graph applies this model to psychedelic
research, where a fast-growing literature spans clinical, mechanistic,
and translational work that is often read separately. Screened papers
are converted into structured evidence records that power an
interactive graph and dashboard. Human researchers can move from
field-level patterns and gaps to source studies visually, while agents
and analytic tools can query the same provenance-rich evidence
directly.
Pipeline Overview
The workflow is designed to keep literature discovery broad while
making graph inclusion conservative. Searches cast a wide net across
clinical, biological, brain, behavioral, subjective, treatment-context,
and real-world evidence. Each paper is then interpreted according to
what it actually contains and how much source text is available, so
the graph shows evidence that can be traced back to specific papers.
Define the Evidence Scope
The project starts with explicit vocabularies for psychedelic
compounds and evidence domains: molecular targets, molecular
pathways and cellular readouts, brain systems, cognitive and
behavioral function, subjective experience, pharmacokinetics and
exposure, intervention context, real-world use, clinical
outcomes, functioning, and safety. These vocabularies define
what the graph can represent and what the search needs to cover.
Discover Candidate Papers
PubMed and OpenAlex searches combine broad domain queries,
focused compound-topic queries, and supplementary direct-pair
checks. Results are matched by DOI and merged into a single
paper library. Metadata enrichment uses different sources
for different needs: PubMed, PMC, OpenAlex, Crossref, and
Semantic Scholar for bibliographic records and abstracts;
PubMed for publication-type labels; and Unpaywall, OpenAlex,
and PMC for open-access full-text or PDF links.
Screen and Route
Candidate papers are screened for clear psychedelic relevance
using their titles and abstracts. Papers that remain in scope
are routed by evidence domain, publication type, and available
source text. This separates, for example, primary studies from
reviews and meta-analyses, and lets the extraction step use
different expectations for full-text and abstract-based
evidence.
Extract Structured Evidence
Eligible papers are processed with LLM-based, route-specific
extraction instructions. The model identifies candidate
structured evidence: compounds, evidence domains, study type,
assay or outcome details, result direction, and source
locators. When PDFs are available, they are first converted
with GROBID into structured TEI full-text artifacts so the
extraction step can use article sections, tables, figures, and
references as auditable evidence anchors. Abstract-only records
are handled more conservatively because they expose less of the
underlying evidence.
Validate and Publish
Extracted evidence is checked for completeness, consistency,
and source support before it appears in the public graph.
Accepted records become graph relationships linking psychedelic
compounds to targets, pathways, brain systems, tasks, clinical
outcomes, safety outcomes, and study contexts. Records that are
ambiguous or insufficiently supported are held back for review.
Maintain the Living Graph
The graph is designed to evolve as the literature grows. New
papers, corrected metadata, improved extraction, and community
feedback can all update the evidence base. Each public build is
versioned, and release notes summarize what changed so
readers can understand how the graph evolves over time.
Literature Search Strategy
The search is organized around evidence domains: molecular targets,
molecular pathways and cellular readouts, brain systems, cognitive
and behavioral functions, subjective experience, pharmacokinetics and
exposure, intervention delivery context, real-world use and public
health, clinical outcomes and safety, and clinical studies that
measure biological or behavioral endpoints.
PubMed was used for curated biomedical indexing, and OpenAlex was
used for broader scholarly coverage across journals, books, and
preprints. Searches use the same three-block structure: compound
terms, domain-specific entity or outcome terms, and evidence-context
terms. Terms inside each block are joined with OR; the blocks are
joined with AND. Broad modules cover domain families, while focused
modules target well-studied compound-topic combinations so that
important papers are not captured only through broad queries.
Molecular targets 10 grouped term combinations
Broad target-family modules
Modules
serotonin receptors; monoamine transporters; glutamate/NMDA/AMPA/mGluR2 targets; opioid, sigma, and TAAR targets; plasticity, TrkB, and BDNF target evidence
Compound block
classic psychedelic, entactogen, dissociative, psychoplastogen, and compound-specific terms including psilocybin, psilocin, LSD, DMT, 5-MeO-DMT, mescaline, MDMA, ketamine, salvinorin A, ibogaine, and noribogaine
Entity block
5-HT receptor families; SERT, DAT, NET, and VMAT2; NMDA, AMPA, and mGluR2 receptors; kappa and mu opioid receptors; sigma-1 receptor; TAAR1; TrkB, BDNF, and neuroplasticity targets
compound-specific network, circuit, imaging, receptor-occupancy, neural-dynamics, and electrophysiology terms
Cognitive and behavioral function 4 grouped term combinations
Broad task and translational behavior modules
Modules
cognitive and affective task domains; translational behavioral assays
Entity block
cognitive flexibility; reversal learning; set shifting; fear conditioning; fear extinction; reward learning; social reward; social cognition; empathy; emotion recognition; attention; impulsivity; prepulse inhibition; working memory; forced swim; tail suspension; sucrose preference; social defeat; elevated plus maze; conditioned place preference; self-administration; relapse; head-twitch response
MDMA social reward and cognition; psychedelic fear extinction and flexibility
Query emphasis
compound-specific social reward, social cognition, empathy, emotion recognition, fear extinction, reversal learning, learning, conditioning, and performance terms
Subjective experience and pharmacokinetics 4 grouped term combinations
Subjective experience and acute-effect modules
Modules
acute subjective effects and phenomenology; subjective-effect measures
epidemiology, surveys, naturalistic use, lifetime or past-year use, drug checking, poison-control and emergency records, toxicity, harm reduction; focused naturalistic, community, retreat, and microdosing use
observational; population; survey; cohort; registry; case series; risk; safety; mental health; wellbeing; adverse event; intoxication; exposure; public health
Clinical outcomes, symptoms, functioning, and safety 17 grouped term combinations
Broad clinical outcome modules
Modules
clinical class core; depression spectrum; PTSD and trauma; substance use and addiction; anxiety, distress, and palliative care; pain, headache, and migraine; OCD, eating disorders, and autism
Clinical block
depression; major depressive disorder; treatment-resistant depression; PTSD; substance use disorder; alcohol, tobacco, opioid, cocaine, methamphetamine, stimulant, and cannabis use disorders; generalized and social anxiety; distress associated with life-threatening disease; OCD; eating disorders; autism spectrum disorder; headache disorders; migraine; chronic pain; fibromyalgia
compound-condition combinations paired with imaging, connectivity, BDNF, cortisol, inflammation, cytokines, EEG, cognitive function, social cognition, emotion recognition, and brain-region terms
Supplementary targeted direct-pair searches very many pair combinations
The grouped domain searches are the primary discovery
instrument. Direct-pair searches are used as a supplementary
layer for selected compound-entity and compound-outcome
combinations: larger generated grids were run for bounded target
and clinical vocabularies, while later domain additions used
targeted pair checks rather than an exhaustive cross-product of
every possible compound and concept.
Pair layer
Pair space
How it was used
Molecular target grid
Canonical compounds paired with the molecular target vocabulary
1,840 compound-target pairs were run as 5,520 binding, receptor-pharmacology, and functional-assay searches.
Clinical evidence grid
Canonical compounds paired with the clinical evidence vocabulary
1,240 compound-clinical evidence pairs were run as 3,717 clinical-trial, randomized/placebo, and treatment-outcome searches.
Brain, network, and task pair files
Canonical compounds paired with brain-region/network, circuit, and cognitive-behavioral task concepts
Generated as versioned search artifacts for 3,600 newly added brain/network/task pairs, but not run as a complete direct-pair discovery search in this build.
Additional direct-pair check
Selected sparse molecular, brain/network, cognitive-behavioral, symptom, functioning, and safety combinations
62 selected pair searches were run after the additional domain searches: 41 molecular/brain/cognitive/pathway searches and 21 clinical symptom/function/safety searches.
{compound} {condition or outcome} randomized placebo; {compound} {condition or outcome} treatment outcome; {compound} {condition or outcome} clinical trial
PRISMA Flow Diagram
Evidence synthesis needs a visible record of what entered the
review and what happened next. This
PRISMA-style
flow follows candidate papers from discovery through screening,
full-text access, conversion, and inclusion. Side boxes show the
current reasons papers leave or pause the full-text path. Because
duplicate records are identified before papers are added, there
is no separate duplicate-removal step in this diagram.
Loading paper flow.
Limits, Updates, And Reuse
Coverage depends on the search vocabulary, source indexing, DOI
metadata, and access to full text. LLMs help scale screening and
extraction, but their outputs stay tied to reviewable records.
Evidence labels and risk-of-bias notes are descriptive unless a
formal certainty or risk-of-bias workflow is added.
Validation
Evidence records are checked against the expected data format and
evidence-policy rules before they are included in the public
graph.
Known Limits
Search terms, provider indexing, missing abstracts, access
restrictions, and PDF conversion quality all shape coverage.
Living Evidence
New searches can be scheduled, new records can be screened, and
the public graph can be rebuilt when accepted evidence changes.
Data Boundary
The public graph publishes identifiers, metadata, structured
evidence, provenance, and project-written summaries. Source
texts remain governed by article licenses and copyright.