The Model Card Template That Passes FDA Pre-Cert Review

The FDA Submission That Got Rejected

Startup: "We're submitting our AI diagnostic tool for FDA Pre-Cert."

FDA Reviewer: "Provide documentation: training data, model architecture, evaluation metrics, clinical validation."

Startup: "We have a white paper..."

FDA: "We need structured documentation. Model card, data card, and clinical evaluation report. Resubmit in 6 months."

The Delay: 6 months of scrambling to create documentation that should've existed from day one.

What FDA Pre-Cert Requires (The Checklist)

Three Documents:

Model Card: What the AI does, how it was trained, limitations
Data Card: Where training data came from, bias testing, quality control
Clinical Evaluation Report: Real-world validation, safety monitoring

Timeline:

Without documentation: 12-18 months to approval
With documentation: 6-9 months

Cost Savings: 6 months of eng time + faster time to market

The FDA-Ready Model Card Template

Section 1: Intended Use

What FDA Wants:

Medical condition/disease targeted
Patient population (age, sex, comorbidities)
Clinical setting (hospital, clinic, home use)
User (physician, nurse, patient)

Example:

INTENDED USE

Medical Condition: Type 2 Diabetes screening
Patient Population: Adults 18-75, no prior diabetes diagnosis
Clinical Setting: Primary care clinic
Primary User: Primary care physician
Decision Support: AI flags high-risk patients for lab testing (HbA1c)

What NOT to Say: "General health screening" (too vague—FDA will reject)

Section 2: Model Architecture

What FDA Wants:

Algorithm type (e.g., "Gradient boosting classifier")
Input features (e.g., "Age, BMI, blood pressure, family history")
Output (e.g., "Risk score 0-100, with threshold at 70 for high-risk")

Example:

MODEL ARCHITECTURE

Algorithm: XGBoost (gradient boosting decision trees)
Version: XGBoost 1.7.0
Inputs: 12 clinical features (age, BMI, systolic BP, fasting glucose, etc.)
Output: Diabetes risk score (0-100)
Threshold: Score ≥70 = High Risk (recommend HbA1c lab test)

Why This Matters: FDA needs to understand how the AI makes decisions (interpretability requirement).

Section 3: Training Data

What FDA Wants:

Source (where data came from)
Volume (how many patients)
Demographics (age, sex, race, ethnicity)
Date range (when data was collected)
Quality control (how you ensured data accuracy)

Example:

TRAINING DATA

Source: Electronic Health Records from [Hospital System], IRB-approved (Protocol #12345)
Volume: 50,000 patients (2018-2023)
Demographics:
  - Age: Mean 52 (range 18-75), SD 14
  - Sex: 52% female, 48% male
  - Race: 60% White, 20% Black, 12% Hispanic, 8% Asian
  - Ethnicity: 85% Non-Hispanic, 15% Hispanic
Data Quality:
  - Missing data: <5% per feature (imputed using median)
  - Outliers: Values >99th percentile reviewed by clinician, corrected or removed
De-Identification: HIPAA-compliant (dates shifted, names removed, rare diagnoses aggregated)

Red Flag: If demographics don't match US population, FDA will ask about bias.

Section 4: Evaluation Metrics

What FDA Wants:

Accuracy, sensitivity, specificity (clinical gold standards)
Performance by demographic subgroup (fairness testing)
Comparison to human clinicians (is AI better?)
Clinical impact (does AI improve patient outcomes?)

Example:

EVALUATION METRICS

Test Set: 10,000 patients (held out, not used in training)
Overall Performance:
  - Sensitivity (Recall): 87% (95% CI: 85-89%)
  - Specificity: 82% (95% CI: 80-84%)
  - AUC: 0.91

Subgroup Performance (Fairness Testing):
  - Female: Sensitivity 88%, Specificity 83%
  - Male: Sensitivity 86%, Specificity 81%
  - White: Sensitivity 89%, Specificity 84%
  - Black: Sensitivity 84%, Specificity 79% (within 5pp, acceptable)

Comparison to Physician:
  - Physician sensitivity: 78% (AI +9pp improvement)
  - Physician specificity: 85% (AI -3pp, acceptable trade-off)

Clinical Impact:
  - Early detection: AI flags 12% more high-risk patients than physician alone
  - Estimated prevented complications: 200 cases/year per 10,000 patients screened

Why This Matters: FDA cares about patient outcomes, not just model accuracy.

Section 5: Limitations and Warnings

What FDA Wants:

Known failure modes (when AI is unreliable)
Contraindications (when NOT to use AI)
Required human oversight (physician must review)

Example:

LIMITATIONS

Known Failure Modes:
  - Lower accuracy for patients with rare comorbidities (<1% of population)
  - Not validated for patients under 18 or over 75
  - Not validated for Type 1 Diabetes (only Type 2)

Contraindications:
  - Do NOT use for patients with pre-existing diabetes diagnosis
  - Do NOT use as sole diagnostic tool (lab confirmation required)

Required Human Oversight:
  - Physician must review all high-risk flags before ordering lab tests
  - AI is decision support, not autonomous diagnosis
  - Physician retains final clinical decision authority

Why This Matters: FDA wants proof you're not overselling the AI's capabilities.

Section 6: Post-Market Surveillance

What FDA Wants:

How you'll monitor AI performance in production
What triggers a safety alert (accuracy drop, adverse events)
How often you'll retrain/update the model

Example:

POST-MARKET SURVEILLANCE

Monitoring Plan:
  - Monthly accuracy tracking on production data (random sample of 500 patients)
  - Alert trigger: Sensitivity drops below 80% OR specificity drops below 75%
  - Physician feedback: Track overrides, false positives, false negatives

Safety Reporting:
  - Adverse events (patient harm) reported to FDA within 30 days
  - Quarterly summary report to FDA (performance metrics, user feedback)

Model Updates:
  - Annual retraining with new data (subject to FDA review)
  - Version control: All model versions documented, old versions archived

Why This Matters: FDA Pre-Cert assumes continuous improvement (not "set it and forget it").

Real Example: Diabetic Retinopathy Detection AI

Product: AI analyzes retinal images, flags diabetic retinopathy.

FDA Submission:

Intended Use: Screen diabetic patients for retinopathy in primary care settings (not ophthalmology clinics).

Model: Convolutional neural network (ResNet-50 architecture)

Training Data: 120,000 retinal images from 5 hospital systems (2015-2020)

Evaluation:

Sensitivity: 92% (FDA target: >85%)
Specificity: 88%
Comparison: Ophthalmologist sensitivity 95% (AI -3pp, acceptable for screening)

Limitations:

Not for patients with cataracts (image quality too poor)
Requires human ophthalmologist to confirm positive findings

Post-Market:

Monthly monitoring: Random sample of 1,000 images re-reviewed by ophthalmologist
Alert: If AI sensitivity drops below 88%, auto-disable pending investigation

FDA Decision: Approved (6 months from submission to clearance).

Why It Worked: Documentation was complete upfront. No back-and-forth with FDA.

The Data Card (Companion to Model Card)

What FDA Wants (separate document):

Data provenance: IRB approval, patient consent, HIPAA compliance
Bias testing: Performance by race, sex, age, socioeconomic status
Data retention: How long you keep training data, why
Data security: Encryption, access controls, audit logs

Example Snippet:

DATA CARD

Provenance:
  - Source: [Hospital System] EHR database
  - IRB: Approved under Protocol #12345, waiver of consent (de-identified data)
  - HIPAA: Compliant (Business Associate Agreement signed)

Bias Testing:
  - Racial parity: Sensitivity within 5pp across racial groups
  - Gender parity: Sensitivity within 3pp (female 88%, male 86%)
  - Age: Lower sensitivity for patients >70 (79% vs. 87% for 40-60 age group)
    → Mitigation: Added warning for physicians treating elderly patients

Data Retention:
  - Training data: Retained for 10 years (FDA device record requirement)
  - Production data: De-identified logs retained for 3 years (monitoring)

Data Security:
  - Encryption: AES-256 at rest, TLS 1.3 in transit
  - Access: Role-based (PM, ML engineer, clinical validator—7 people total)
  - Audit logs: Reviewed quarterly by compliance team

Checklist: Is Your Model Card FDA-Ready?

Intended use (specific medical condition, patient population, clinical setting)
Model architecture (algorithm, inputs, outputs, threshold)
Training data (source, volume, demographics, quality control)
Evaluation metrics (sensitivity, specificity, AUC, subgroup performance)
Comparison to human clinician (is AI better/worse?)
Clinical impact (does AI improve patient outcomes?)
Limitations (failure modes, contraindications, required oversight)
Post-market surveillance (monitoring plan, safety reporting, update schedule)

If any box is unchecked, FDA will request more documentation.

Common PM Mistakes

Mistake 1: Claiming "General Purpose" AI

Reality: FDA requires narrow, well-defined medical use cases
Fix: Specify exact condition, population, setting (not "health screening")

Mistake 2: No Bias Testing

Reality: FDA will reject if you haven't tested performance across demographics
Fix: Report sensitivity/specificity by race, sex, age (minimum)

Mistake 3: No Post-Market Plan

Reality: FDA Pre-Cert assumes you'll monitor and update the AI
Fix: Document monitoring frequency, alert triggers, update process

Alex Welcing is a Senior AI Product Manager in New York who writes FDA-ready model cards before submitting medical device AI. His regulatory approvals take 6 months, not 18, because documentation is a product requirement from day one.

The Model Card Template That Passes FDA Pre-Cert Review

The FDA Submission That Got Rejected

What FDA Pre-Cert Requires (The Checklist)

The FDA-Ready Model Card Template

Section 1: Intended Use

Section 2: Model Architecture

Section 3: Training Data

Section 4: Evaluation Metrics

Section 5: Limitations and Warnings

Section 6: Post-Market Surveillance

Real Example: Diabetic Retinopathy Detection AI

The Data Card (Companion to Model Card)

Checklist: Is Your Model Card FDA-Ready?

Common PM Mistakes

Related Research

The AI PM's September Checklist: Audit Season Prep for Q4 Compliance

The NIH BRAIN Initiative Data Standard: What It Means for Neuroscience AI

NIH Data Management Policy for AI PMs: What It Means If You Use Health Data