Civic Sample - Clinical Trials Demographics Dashboard

Total Studies

Report Race

Report Ethnicity

Report Both

Reporting Trends Over Time

Dashed 2017 marker: the FDAAA Final Rule (42 CFR Part 11) took effect January 18, 2017, clarifying and enforcing sponsors’ obligation to report trial results — see the FAQ entry on FDAAA for the policy history.

Total Participants with Reported Race Data

Shows the total number of participants with explicitly reported race data per year (excludes "Unknown" and studies without race data).

Full Distribution with Data Quality

Proportion of total enrollment by category, distinguishing between explicitly unknown data and implicit missing (not reported) data.

Race Distribution (NIH/OMB Categories)

Race Over Time

Proportion among studies that reported race (excludes missing and unknown).

Total Participants with Reported Ethnicity Data

Shows the total number of participants with explicitly reported ethnicity data per year (excludes "Unknown" and studies without ethnicity data).

Full Distribution with Data Quality

Proportion of total enrollment by category, distinguishing between explicitly unknown data and implicit missing (not reported) data.

Ethnicity Distribution

Note: The "Unknown or Not Reported" category is large because many studies do not collect ethnicity data or participants decline to report. This reflects limitations in data collection practices across clinical trials.

Ethnicity Over Time

Proportion among studies that reported ethnicity (excludes missing and unknown).

Disclaimer: This is currently an unstructured approach to wrangling this data, presented for demonstration purposes. We are refining this approach and expect to have a preprint detailing our methodology available by June 2026.

Total Participants with Reported Sex Data

Shows the total number of participants with explicitly reported sex data per year (excludes "Unknown" and studies without sex data).

Full Distribution with Data Quality

Proportion of total enrollment by category, distinguishing between explicitly unknown data and implicit missing (not reported) data.

Sex Distribution

Sex Ratio Over Time

Proportion among studies that reported sex (excludes missing and unknown).

Total Participants with Reported Gender Data

Shows the total number of participants with explicitly reported gender identity data per year (excludes "Unknown" and studies without gender data).

Full Distribution with Data Quality

Proportion of total enrollment by category, distinguishing between explicitly unknown data and implicit missing (not reported) data.

Gender Distribution

Note: Gender identity is rarely reported separately from biological sex in clinical trials.

Gender Over Time

Proportion of reported gender identities across years (excludes missing and unknown).

Loading study data and details...

Preparing table for full interactivity

Map Layer:

Total Trials

Single-site

Multi-site

Location Not Reported

Trials by US State

Click a state to see city-level breakdown. Darker colors indicate higher values.

100

Regional Distribution

Trials by US Census Region

Site Distribution

Breakdown of trials by number of sites.

Geographic Reporting Over Time

Percentage of trials reporting location data by year.

Trials of FDA-Regulated Products

FDA-Regulated Drug Trials

FDA-Regulated Device Trials

No FDA-Regulated Product

A trial counts as FDA-regulated when its sponsor reports studying a drug or device product subject to U.S. FDA oversight — the product is on the regulatory pathway, typically under an IND (drugs) or IDE (devices). Trials with no FDA-regulated product are registered studies whose intervention sits outside FDA product jurisdiction: behavioral programs, procedures, products not bound for the U.S. market. Within device trials, the unapproved device flag marks pre-market evidence generation — the device has not yet been approved or cleared by the FDA. Comparing demographic reporting across these classes shows where the regulatory pathway is and isn't producing representative evidence.

Sex, Race & Ethnicity Reporting by Regulatory Status

Share of trials in each regulatory class that report each demographic. Classes are mutually exclusive; group sizes are in the axis labels.

FDA-Authorized AI/ML-Enabled Medical Devices

Data source: FDA Artificial Intelligence-Enabled Medical Devices · Last updated March 5, 2026

Everything on this page holds FDA marketing authorization — these are products that completed a premarket pathway, not registered trials. Devices arrive by three routes: 510(k) clearance (substantial equivalence to an already-marketed predicate device), De Novo (novel devices of low-to-moderate risk without a predicate), and PMA (full premarket approval on clinical evidence, the most stringent review). AI/ML is a young regulatory category — authorizations climbed from a handful per year before 2017 to hundreds per year now, and the overwhelming majority arrive through the 510(k) predicate route rather than PMA-level clinical review. The FDA does not systematically publish demographic data for these devices' validation studies; the (Beta) AI Demographic Extraction tab mines the public 510(k)/De Novo/PMA summary PDFs for what was reported, and demographic results for these devices will populate here in a future update.

Devices by Medical Panel

Marketing Authorizations per Year, by Pathway

Markers: FDA’s 2017 Digital Health Innovation Action Plan and the 2019 proposed regulatory framework for AI/ML-based software as a medical device (SaMD).

De Novo & PMA Authorizations per Year

The non-510(k) slice at its own scale — next to the 510(k) volume above, these pathways read as invisible slivers. This is the small set of AI/ML devices whose authorization carried novel-device (De Novo) or full clinical (PMA) review.

Markers: FDA’s 2017 Digital Health Innovation Action Plan and the 2019 proposed regulatory framework for AI/ML-based software as a medical device (SaMD).

Rows

Date of Final Decision	Submission #	Pathway	Device	Company	Panel	Product Code	Summary Document

(Beta) AI Demographic Extraction

Who was in the validation studies behind the products on the AI Devices tab? The FDA does not publish that systematically — so this pipeline has Claude read each device's public 510(k)/De Novo/PMA summary PDF and extract the demographics of its clinical validation, surfacing information about these authorized devices that is not available anywhere else on the dashboard. Where a matched manuscript exists, its numbers are shown alongside the FDA summary's for the same device.

Every document is run through three Claude models independently — agreement between them is a first signal of extraction reliability, and the pilot comparison below tracks what a full-scale run would cost. Columns for socioeconomic status, disability, and household measures are already reserved in the schema; the extraction prompts for those fields are being finalized and will populate in a future update. Everything on this tab is raw model output shown before curator review — confirmed values graduate through the Approval Queue.

Pilot Run — Three Models Compared

Each document is processed through all three models. Compare cost, speed, and extraction quality side by side.

—

Documents Processed — successfully extracted

Pages Processed — total pages across docs

Remaining Docs — to be processed

Demographic Reporting Frequency

Extracted FDA Demographics (Pilot Sample)

Fields marked Not Reported highlight the gap in FDA demographic disclosure. The Agree column indicates whether all 3 models returned the same value.

View model:

Each demographic cell is split FDA Summary (top) vs. Matched Manuscript (bottom). Manuscripts are linked by FDA Submission #.

Rows

Submission #	Date of Final Decision	Device Name	Details	Panel	Matched Manuscript	Total Participants	Age	Sex	Gender	Race (NIH/OMB)	Ethnicity	Geography	SES — Income / Education / Wealth	Disability	Household

(Beta) Paper Data Extraction

Published manuscripts often report more demographic detail than sponsors enter into the registry. This pipeline links each open-access trial publication (via Unpaywall) to its ClinicalTrials.gov record by NCT ID, has Claude extract the paper's demographic tables, and compares the two sources field by field — confirming what the registry already says, adding what the paper reports but the registry lacks, or flagging conflicts where they disagree.

Every manuscript is run through three Claude models independently — agreement between them is a first signal of extraction reliability, and the pilot comparison below tracks what a full-scale run would cost. Columns for socioeconomic status, disability, and household measures are already reserved in the schema; the extraction prompts for those fields are being finalized and will populate in a future update. Everything on this tab is raw model output shown before curator review — confirmed additions and resolved conflicts graduate through the Approval Queue.

Pilot Run — Three Models Compared

Each manuscript is processed through all three models. Compare cost, speed, and extraction quality side by side.

—

Documents Processed — successfully extracted

Pages Processed — total pages across docs

Remaining Docs — to be processed

Manuscript Discrepancy Engine (Pilot Sample)

For each manuscript the LLM-extracted values are compared against the ClinicalTrials.gov record for the linked NCT ID. Each field is flagged Match Addition or Conflict. Curators confirm or deny additions and conflicts; the action is gated by a curator password.

View model:

Rows

Linked NCT ID	Date Results Reported	Trial Name	Details	Manuscript Title / DOI	Total Participants	Age	Sex	Gender	Race (NIH/OMB)	Ethnicity	Geography	SES — Income / Education / Wealth	Disability	Household

(Beta) Approval Queue

The last gate before AI-extracted values reach the public dashboard. Records produced by the two extraction pipelines — FDA summary-PDF demographics and manuscript comparisons — land here for a curator to approve or reject one by one; nothing extracted by a model is published without a human decision. Decisions are recorded in a local, in-session ledger — the source extraction files are never modified.

Industry Sponsor Representation

Female enrollment share across the top-10 industry sponsors, over the mixed-sex, sex-reporting interventional cohort (primary completion 2009 or later, not terminated).

Trials / cell –

Annual median within-trial percent female by primary completion year, per selected sponsor. Years with fewer than 5 trials for a sponsor are left blank; the dashed grey line is the pooled industry median and the dotted line marks 50% parity. Respects the global Year Range (results posted) and Condition filters.

Percent female is within-trial female / (female + male); explicitly reported "Unknown" counts stay out of the denominator and are never treated as missing. Sponsor assignment follows the AACT lead-sponsor-first contract with industry collaborators as fallback (the Role toggle restricts to lead-sponsored trials); companies are grouped with the analysis pipeline's canonical map. On the Race and Ethnicity tiers, a category's share is computed over the trial's explicitly reported categories on the same principle, and the Adjusted Differences view models the selected category's balance vs White (Hispanic vs Not Hispanic for ethnicity). Gated beta — descriptive views respond to the global Year Range and Condition filters above and to the Role and Scope toggles; the adjusted differences are model estimates on the full cohort for the selected Role.

Methodology FAQ

How are industry sponsors parsed from the registry?

Sponsor fields follow the AACT database's extraction contract for ClinicalTrials.gov's sponsorCollaboratorsModule: the lead sponsor maps to role "lead" and each collaborator to role "collaborator", each carrying an agency class (Industry, NIH, Federal, Other, Network) and an organization name. A trial enters this view when its lead sponsor is Industry — or, failing that, when at least one collaborator is Industry, in which case the alphabetically first industry collaborator is assigned. The "Lead Sponsor Only" toggle drops the collaborator-fallback trials entirely.

How are subsidiaries mapped to parent companies?

Raw sponsor names are grouped with the analysis pipeline's canonical company map: an ordered list of patterns where the first match wins (for example, Wyeth, Hospira, and Seagen fold into Pfizer; Celgene into Bristol-Myers Squibb; Genzyme into Sanofi; Janssen into Johnson & Johnson — while Genentech, Abbott, and Organon are deliberately kept separate from their affiliates). Names matching no pattern keep their own identity after legal suffixes (Inc, Ltd, GmbH, Pharmaceuticals, and similar) are stripped, so "Acme Pharmaceuticals, Inc." and "Acme Pharma Ltd" count as one sponsor.

How are non-participant counts (COUNT_OF_UNITS) handled?

Some baseline tables count units other than people — tests, eyes, lesions — tagged COUNT_OF_UNITS in the registry, and their totals can legitimately exceed enrollment. The percent-female analysis is restricted to participant counts: this dashboard's extractor only converts percentage measures using participant-unit denominators and rejects cluster denominators such as clinics or sites, the same guard the source analysis applies by excluding COUNT_OF_UNITS measures from percent female.

What do the Role and Scope toggles change?

Role controls how a trial is attributed to a company: "Lead & Collaborator" (default) uses the lead-first-with-collaborator-fallback assignment above, while "Lead Sponsor Only" keeps only trials whose lead sponsor is the industry company. Scope controls which sponsor set the views display: the top-10 named sponsors, or all industry — which adds the pooled "Other Industry" bucket (every industry trial outside the top 10) as its own row and trend line. The Adjusted Differences view always contrasts each named sponsor against Other Industry, re-fit for the selected Role.

Which condition categories are treated as sex-specific?

Five secondary condition categories are classed as sex-specific: Breast Cancer, Prostate Cancer, Infertility, Pregnancy Complications, and Menopause and Hormonal. On the Sex tier they are excluded from the condition axis by default — their heavily single-sex case mix would otherwise masquerade as sponsor enrollment behavior — and a toggle above the views adds them back. The Race and Ethnicity tiers always include them, since racial and ethnic representation is measured within those diseases the same as any other.

What do the benchmark options compare against?

Cohort Baseline (default) measures each sponsor against the condition's pooled median across all industry trials in the current filter — "does this sponsor differ from industry practice?". 50% Parity (Sex) and Census Share (Race/Ethnicity, 2020 U.S. Census population shares) measure against the general population — "does enrollment mirror the population?". Disease Prevalence — "does enrollment mirror the patients?" — is held as a placeholder until per-condition prevalence breakdowns by sex, race, and ethnicity are integrated; population benchmarks can mislead where disease burden is skewed, which is exactly what the prevalence benchmark will correct.

Why can't the heatmap and trend show every filter combination?

Cell and yearly medians are only displayed when enough trials remain: heatmap cells need at least 10 trials (below that the cell shows its n uncolored) and trend points need at least 5 per sponsor-year. Medians on fewer trials are noisier than the differences being read, so small cells are shown as thin data rather than colored as signal.

Frequently Asked Questions

What led you to do this?

Many people claim that trials are not diverse. There are also many ongoing initiatives to increase diversity in clinical trials. There are not many publicly available tools to assess the progress of those initiatives, or get a holistic view of trial diversity. We thought this would be a great start.

Where do you get this data from?

We use the ClinicalTrials.gov API — it is a great resource and should be more widely used, as should programs like the Aggregate Analysis of ClinicalTrials.gov (AACT) database from the Clinical Trials Transformation Initiative. The demographic variables we display here are more difficult to parse than the more standardized variables (e.g., trial phase, total participants). We designed this dashboard based on our experience parsing these sociodemographic characteristics — namely race and ethnicity (published here) — and ongoing projects examining sex, gender, and geography.

Which studies are included in the dashboard?

The dashboard covers studies registered on ClinicalTrials.gov that have posted results, from the first results postings in 2009 through the present. The Year Range filter is based on the date results were first posted, not on when the study was conducted. By default the dashboard shows interventional studies; the Study Type filter can switch to observational studies or all studies.

The four summary cards on the Overview tab describe the set of studies matching your current filters: the total number of studies, and the share that explicitly reported race data, ethnicity data, or both.

How often is the data updated, and what are snapshots?

An automated pipeline re-extracts the full dataset from the ClinicalTrials.gov API weekly (Sundays at 6 AM UTC). The "Last updated" date in the header reflects the most recent extraction.

Each extraction is also archived as a dated snapshot. The "View snapshot" selector in the header loads the dashboard exactly as it stood on a past extraction date, which is useful for tracking how reporting evolves over time.

How far back does the historical snapshot log go?

Snapshots follow a tiered retention schedule: the four most recent snapshots at roughly bi-weekly spacing (covering about the last two months) are kept as complete datasets, and beyond that one snapshot per calendar month is kept as an aggregate archive — every chart renders exactly as extracted on that date, while the interactive filters and the full study-level table require a complete (bi-weekly or latest) dataset. The live dashboard always reflects the newest weekly extraction regardless of retention.

The pruning runs automatically alongside the weekly extraction, and the "View snapshot" selector lists exactly the retained dates. We keep the log tiered rather than unlimited because each snapshot is roughly 140 MB and the dashboard is served as a static site: an unbounded archive makes deployments slow and unreliable long before it adds analytical value. If you need a specific historical extraction that has aged out of the log, email us — full weekly extractions are archived off-site.

Is this work currently funded?

We received an API credit grant through Anthropic PBC in May 2026. So far the elements of this initiative that the API credits will enhance have not been implemented, but we expect to begin to roll them out in late July 2026. We are open to conversations about supporting this work in other ways alongside suggestions you have for our approach. Shoot us an email [email protected]

How is Funding Source derived?

Funding source is categorized based on sponsor information:

Industry: Lead Sponsor is Industry
NIH: Lead Sponsor is NIH, OR (Lead Sponsor is Other/Network AND any Collaborator is NIH)
Other U.S. Federal: Lead Sponsor is Federal, OR (Lead Sponsor is Other/Network AND any Collaborator is Federal)
Other: All other cases

How are conditions categorized?

Conditions are categorized using a standardized medical hierarchy. We group specific conditions (e.g., "Congenital Heart Disease") into broader Primary Categories (e.g., "Cardiovascular"), with more granular Secondary Categories underneath. This reduces redundancy from synonyms (e.g., "congenital heart defect" and "congenital heart disease" map to the same secondary category) and allows for both broad and granular filtering.

The classification uses a two-step process:

Exact/Substring Match: Each condition is checked against a curated list of keywords and synonyms, matched longest-first so specific terms (e.g., "heart failure") take priority over general ones (e.g., "heart").
Fuzzy Match: If no exact match is found, lightweight fuzzy string matching (via rapidfuzz) catches typos and minor variations (e.g., "Type II Diabetes" vs "Type 2 Diabetes").

Primary Category	Example Secondary Categories
Cardiovascular	Heart Failure, Coronary Artery Disease, Arrhythmia, Hypertension, Congenital Heart Disease, Valvular Heart Disease, Cardiomyopathy, Peripheral Vascular Disease
Oncology	Breast Cancer, Lung Cancer, Colorectal Cancer, Prostate Cancer, Hematologic Malignancy, Brain and CNS Tumors, Skin Cancer, Sarcoma
Neurology	Alzheimer's Disease and Dementia, Parkinson's Disease, Epilepsy and Seizure Disorders, Multiple Sclerosis, Stroke and Cerebrovascular, Headache and Migraine
Respiratory	COPD, Asthma, Pulmonary Fibrosis, Pneumonia, Pulmonary Hypertension, Sleep Apnea
Mental Health	Depression, Anxiety Disorders, Bipolar Disorder, Schizophrenia and Psychotic Disorders, PTSD and Trauma, ADHD, Autism Spectrum, Eating Disorders
Endocrine and Metabolic	Type 1 Diabetes, Type 2 Diabetes, Obesity, Thyroid Disorders, Lipid Disorders
Infectious Disease	HIV/AIDS, Hepatitis, COVID-19, Tuberculosis, Influenza, Bacterial Infections, Parasitic Diseases
Autoimmune and Inflammatory	Rheumatoid Arthritis, Systemic Lupus Erythematosus, Inflammatory Bowel Disease, Psoriasis and Psoriatic Arthritis, Vasculitis
Gastrointestinal	GERD and Esophageal, Liver Disease, Irritable Bowel Syndrome, Pancreatic Disorders
Kidney and Urological	Chronic Kidney Disease, End-Stage Renal Disease, Glomerular Diseases, Kidney Transplant, Urological Disorders
Musculoskeletal	Osteoarthritis, Osteoporosis, Back and Spine, Fibromyalgia, Gout, Fractures and Trauma
Dermatology	Eczema and Dermatitis, Psoriasis, Acne and Rosacea, Wound and Ulcer, Hair and Nail Disorders
Substance Use Disorders	Alcohol Use Disorder, Opioid Use Disorder, Tobacco and Nicotine
Hematology	Anemia, Coagulation Disorders, Thrombosis
Ophthalmology	Macular Degeneration, Glaucoma, Diabetic Eye Disease
Reproductive and Sexual Health	Infertility, Pregnancy Complications, Menopause and Hormonal
Transplant and Immunology	Solid Organ Transplant, Bone Marrow Transplant, Allergy
Rare Diseases	Cystic Fibrosis, Amyloidosis, Lysosomal Storage Disorders
Pain	Chronic Pain, Acute Pain, Cancer Pain
Other	Any condition not matching the above categories

Note: A study may have conditions spanning multiple categories. The dashboard filters show studies that match ANY of the keywords for the selected primary and/or secondary category. You can filter by primary category alone for broad analysis, or drill down to a specific secondary category for more targeted results.

How is the "Not Reported (Missing)" category calculated?

In the "Full Distribution with Data Quality" charts, the light "Not Reported (Missing)" layer is calculated as:

Not Reported = Total Enrollment - Sum(All Reported Categories)

"Explicitly Unknown" is different: it is a category sponsors actively reported (see "What is the difference between 'Not Reported' and 'Explicit Unknown'?" below). Together the layers sum to 100% of enrollment, providing a complete picture of data completeness. In rare years where reported counts exceed registered enrollment — a registry data-quality quirk — the larger of the two is used as the denominator so percentages never exceed 100%.

How is Geography determined?

Geography comes from each trial's registered study sites (the facility city, state, and country listed on ClinicalTrials.gov). Trials are classified as single-site or multi-site, and as "Location Not Reported" when no sites are listed. The U.S. view aggregates sites by state and by U.S. Census region.

Note that a multi-site trial is counted in every state or country where it has a site, so map and regional totals can exceed the total number of trials.

What about searching for specific trials and summarizing the information in other ways?

This project is currently focused on demographics surrounding clinical trials. There are other tools that do a great job at searching unstructured data from ClinicalTrials.gov. There is a great connector for Claude Code built by the company deepsense.ai. More information on that connector is here.

How do you count trial sponsors?

Sponsor information appears in three places on the dashboard, and each works slightly differently:

The Sponsor filter (filter bar) matches the lead sponsor's class as registered on ClinicalTrials.gov: NIH, Industry, Federal (non-NIH), Other, or Network.
Funding Source (study detail view) is derived from the lead sponsor and collaborators together — see "How is Funding Source derived?" above.
The sponsor dropdown on the Geography tab takes a broad approach: an organization is counted if it is listed as either the Lead Sponsor or a Collaborator, to capture the full ecosystem of organizations supporting a trial rather than just the primary administrative entity.

What race and ethnicity categories are used?

We standardize to the NIH/OMB reporting categories that sponsors use when posting results to ClinicalTrials.gov. Race: American Indian/Alaska Native, Asian, Black/African American, Native Hawaiian/Pacific Islander, White, More than one race, Other, and Unknown/Not Reported. Ethnicity: Hispanic/Latino, Not Hispanic/Latino, and Unknown/Not Reported.

Sponsors often report more granular labels (e.g., specific Asian subgroups); where available these are preserved and shown in the subcategory charts on the Race and Ethnicity tabs. All values reflect what sponsors reported to the registry — we do not infer or impute demographic categories.

How are Sex and Gender defined on this dashboard?

They are extracted separately and never cross-mapped. Sex refers to biological sex as reported in a trial's baseline characteristics, standardized to Female, Male, and Unknown or Not Reported. Gender refers to gender identity, standardized to Woman, Man, Non-binary, Transgender, Other, and Unknown or Not Reported — and is only parsed from baseline tables explicitly labeled as gender.

Because the pipeline keeps these strictly decoupled (a "Female" label is never counted as "Woman", and vice versa), the Gender tab reflects only trials that genuinely collected gender identity — which remains rare in the registry.

How is time to report calculated?

Time to report is calculated as the difference in days between when the study results were first posted and the study's primary completion date (the date the final participant was examined for the primary outcome). If a primary completion date is not available, the overall study completion date is used instead.

Time to Report = Results First Posted Date - Primary Completion Date

A positive number (red/orange bar) indicates the results were posted after the study's primary completion. A negative number (blue bar) indicates results were posted early, before the primary completion date.

What do the FDA regulatory status fields mean?

These fields indicate the level of U.S. Food and Drug Administration (FDA) oversight for a given clinical trial, as reported by the trial sponsor to ClinicalTrials.gov.

FDA Regulated Drug: Indicates the trial is studying a drug or biological product subject to FDA regulations.
FDA Regulated Device: Indicates the trial is studying a medical device subject to FDA regulations.
Unapproved Device: Indicates that at least one device being studied has not been previously approved or cleared by the FDA for any use.
Non-Regulated / Unflagged: The trial does not have these specific FDA flags set to true. This bucket contains the majority of studies, including behavioral interventions, surgical technique comparisons, international trials without a U.S. IND, and older legacy trials conducted before these fields were mandated.

Note: These fields indicate regulatory jurisdiction over the trial itself. They do not mean the FDA has ultimately approved the drug or device for public market availability.

Does the dashboard contain any individual participant data?

No. Everything shown here is built from aggregate counts that sponsors post publicly to ClinicalTrials.gov — for example, the number of participants in each race, ethnicity, sex, or gender category from a trial's baseline characteristics table. The dashboard contains no individual-level or identifiable participant data, and we collect none ourselves.

Why does the mobile version show fewer features?

The full per-study dataset is roughly 136 MB compressed — too heavy for phones. Mobile browsers instead load a small pre-computed summary: the charts show aggregates over the whole dataset (the filters don't apply), the Studies table is limited to the 500 most recently posted studies, and the filter bar and Geography tab (which need the full per-study data) are desktop-only.

Beta Data Extraction & Curation

How does the pipeline select which manuscript to download?

For every trial, the pipeline queries manuscript APIs (EuropePMC, Unpaywall) using the NCT ID and receives a list of candidate papers. Rather than taking the first hit, each candidate is passed through an adjusted scoring function:

Title-Based Penalty: Candidates whose titles contain tokens like protocol, secondary analysis, retrospective, observational, survey, post-hoc, design, or rationale receive a heavy score deduction — these are almost never the primary-results paper we want.
Temporal Bonus: Candidates published 0–24 months after the trial's primary_completion_date (or completion_date as a fallback) receive a score boost, since primary-results manuscripts typically appear within that window.
Selection: All candidates are sorted by adjusted score (ties broken by newer publication date) and the highest-scoring paper is chosen as the PDF to download and extract.

This replaces the naive "first result wins" heuristic and materially reduces the number of design/protocol papers that leak into the extraction set.

What are the Manuscript Match Tiers?

Because ClinicalTrials.gov does not always explicitly link to published results, our pipeline uses a 3-tier system to find manuscripts:

Tier 1 (Explicit Match): The manuscript explicitly contains the NCT ID. However, this still requires curation! Researchers often link secondary analyses, protocols, or observational spin-offs to their primary NCT ID. While the link is explicit, the paper might not contain the primary trial demographics.
Tier 2 (High-Confidence Fuzzy Match): The pipeline found a paper where the authors, publication year, and keywords strongly align with the trial's metadata.
Tier 3 (Low-Confidence Fuzzy Match): A broad keyword search match. These require strict manual curation, as the search may pull in loosely related papers.

How does the Discrepancy Engine work?

AI models can hallucinate or extract data from improperly matched papers (e.g., a Tier 1 protocol paper instead of the results paper). To ensure data integrity, our engine compares every LLM-extracted demographic value against the original ClinicalTrials.gov API baseline. It flags fields as a Match, an Addition (finding data the registry missed), or a Conflict (contradicting the registry). Curators must manually verify Additions and Conflicts.

What is the difference between "Not Reported" and "Explicit Unknown"?

In clinical trial demographics, missing data and "unknown" classifications are fundamentally different categories. "Not Reported" means the trial investigators completely failed to mention the demographic variable. "Explicit Unknown" means the researchers actively collected the data, but the specific demographic identity of certain participants could not be determined or was declined by the participant. Our dashboard explicitly visually separates these two states.

What AI Models are being compared in the Beta?

We are running a 3-way model comparison to evaluate cost, speed, and extraction quality using Anthropic's Claude family:

Haiku 4.5: The fastest and most cost-effective model.
Sonnet 4.6: The balanced baseline for speed and intelligence.
Opus 4.8: The highest quality model for complex reasoning and navigating dense clinical methodologies.

How are AI studies identified?

A study is flagged as AI-related if any of its text fields contain one or more keywords associated with artificial intelligence or machine learning. The fields searched are: brief title, primary endpoint, conditions, primary condition category, and secondary condition category.

The keyword list includes:

artificial intelligence, machine learning, deep learning
neural network, large language model, LLM
natural language processing, computer vision
reinforcement learning, generative AI, chatbot
predictive algorithm, clinical decision support algorithm
algorithm-based, algorithm-driven, AI-based, AI-driven, AI-powered, ML-based, ML-driven

Matching is case-insensitive and uses whole-word boundary matching to avoid false positives (e.g., "algorithm-based" is matched, but "algorithm" appearing inside "logarithm" is not). This approach is intentionally broad to capture the evolving vocabulary around AI in clinical research, while still being precise enough to avoid spurious matches.

What is FDAAA, and what changed in 2017?

The Food and Drug Administration Amendments Act of 2007 (FDAAA) is the law behind most of the data on this dashboard. Its Section 801 created the legal requirement that "applicable clinical trials" of drugs, biologics, and devices be registered on ClinicalTrials.gov and — for the first time — that sponsors submit basic results, including participant demographics, generally within 12 months of the trial's primary completion date.

For its first decade the statute left key terms ambiguous and results reporting lagged. The Final Rule (42 CFR Part 11), published September 2016 and effective January 18, 2017, resolved that: it spelled out exactly which trials are covered (including trials of unapproved products), what must be submitted (baseline demographics such as age, sex, and — when collected — race and ethnicity; outcomes; adverse events), the deadlines, and the penalties for non-compliance (civil monetary fines and, for federally funded trials, grant suspension). A companion NIH policy applied the same expectations to all NIH-funded trials on the same date. ClinicalTrials.gov maintains a summary at FDAAA 801 and the Final Rule.

This is why several time-series charts on this dashboard carry a dashed marker at 2017: reporting behavior before and after that line reflects two different enforcement regimes, and comparisons that span it should keep the rule change in mind.

Civic Sample is a tool created to investigate the demographic characteristics of participants in clinical trials. Centered on the ideal that diversity and representation in trials is important to prevent sample bias and improve study generalizability. This mission starts with understanding who is involved in studies, so we can move towards strategizing methods to improve representation.

Who built this

Maryam Aziz

Maryam Aziz is a Ph.D. candidate in Population Health Sciences at Duke University School of Medicine and holds an M.S. in Computer Science from Columbia University. Her research focuses on human-centered development and evaluation of AI in healthcare, particularly for women's health, with an emphasis on equity, transparency, and clinical impact.

Michael D. Green, Ph.D.

Michael is a Postdoctoral Researcher at the Department of Health, Behavior, and Society at the Johns Hopkins School of Public Health. Michael earned a Ph.D. in Population Health Sciences at the Duke University School of Medicine, and a BA in Anthropology w/ honors from Dartmouth College. Michael's research focuses on unequal treatment in healthcare.

Both hope to advance work to first establish a clear platform for accountability and transparency around the state of diversity in clinical trials, and second assist trial sponsors, investigators, and companies with approaches to diversify their trial population to strive for a representative trial.

Total Studies

Report Race

Report Ethnicity

Report Both

Reporting Trends Over Time

Total Participants with Reported Race Data

Full Distribution with Data Quality

Race Distribution (NIH/OMB Categories)

Race Over Time

Total Participants with Reported Ethnicity Data

Full Distribution with Data Quality

Ethnicity Distribution

Ethnicity Over Time

Total Participants with Reported Sex Data

Full Distribution with Data Quality

Sex Distribution

Sex Ratio Over Time

Total Participants with Reported Gender Data

Full Distribution with Data Quality

Gender Distribution

Gender Over Time

Study Details

Geography view is desktop-only

Trials by US State

Regional Distribution

Cities in Selected State

Trials by Country

Site Distribution

Geographic Reporting Over Time

Trials of FDA-Regulated Products

FDA-Regulated Drug Trials

FDA-Regulated Device Trials

No FDA-Regulated Product

Sex, Race & Ethnicity Reporting by Regulatory Status

FDA-Authorized AI/ML-Enabled Medical Devices

Devices by Medical Panel

Marketing Authorizations per Year, by Pathway

De Novo & PMA Authorizations per Year

(Beta) AI Demographic Extraction

Pilot Run — Three Models Compared

Demographic Reporting Frequency

Extracted FDA Demographics (Pilot Sample)

(Beta) Paper Data Extraction

Pilot Run — Three Models Compared

Manuscript Discrepancy Engine (Pilot Sample)

(Beta) Approval Queue

Industry Sponsor Representation

Methodology FAQ

How are industry sponsors parsed from the registry?

How are subsidiaries mapped to parent companies?

How are non-participant counts (COUNT_OF_UNITS) handled?

What do the Role and Scope toggles change?

Which condition categories are treated as sex-specific?

What do the benchmark options compare against?

Why can't the heatmap and trend show every filter combination?

Frequently Asked Questions

What led you to do this?

Where do you get this data from?

Which studies are included in the dashboard?

How often is the data updated, and what are snapshots?

How far back does the historical snapshot log go?

Is this work currently funded?

How is Funding Source derived?

How are conditions categorized?

How is the "Not Reported (Missing)" category calculated?

How is Geography determined?

What about searching for specific trials and summarizing the information in other ways?

How do you count trial sponsors?

What race and ethnicity categories are used?

How are Sex and Gender defined on this dashboard?

How is time to report calculated?

What do the FDA regulatory status fields mean?

Does the dashboard contain any individual participant data?

Why does the mobile version show fewer features?

Beta Data Extraction & Curation

How does the pipeline select which manuscript to download?

What are the Manuscript Match Tiers?

How does the Discrepancy Engine work?

What is the difference between "Not Reported" and "Explicit Unknown"?

What AI Models are being compared in the Beta?