Data Quality Checking in OMOP: A Practical Guide

Robert Anderson, PhDRobert Anderson, PhD
April 7, 2026
20 min read
Data Quality Checking in OMOP: A Practical Guide

Clinical teams usually notice a data quality problem late. The dashboard looks plausible. The cohort counts are close to expectation. The phenotype logic passes review. Then someone drills into the records and finds that local diagnosis codes were mapped inconsistently, visit dates crossed impossible timelines, or a supposedly standard concept was not standard at all.

That kind of failure is common in OMOP work because the pipeline can run cleanly while the semantics drift underneath it. Tables load. dbt models finish. Analysts keep moving. Bad mappings and weak validation rules do not always crash the ETL. They distort it.

In regulated healthcare, that is not a minor inconvenience. It affects research reproducibility, operational reporting, compliance reviews, and trust in every downstream model. Good data quality checking is not a polishing step after ETL. It is part of the ETL design.

The Silent Failure of Clinical Data Pipelines

A familiar pattern shows up in clinical data programs. A team finishes a major OMOP load, hands the dataset to researchers, and everyone focuses on analysis. Weeks later, a discrepancy appears. A condition prevalence looks off. A medication exposure trend does not match what clinicians expect. The SQL is fine. The study logic is fine. The issue sits upstream in vocabulary mapping, duplicate ingestion, or a missing field that got defaulted.

A scientist in a lab coat examining a colorful digital data chart on a computer screen.

The worst pipeline failures are the quiet ones. A hard failure is painful, but at least it stops the run. A silent failure produces data that looks complete enough to use. That is what makes it dangerous in OMOP environments, especially when multiple source systems feed the same vocabulary layer.

What silent failure looks like in practice

In healthcare ETL, these are the issues that repeatedly slip through when teams only validate schema and row counts:

  • Mapped but wrong: A source code resolves to a concept, but it is not the right standard concept for the intended domain.
  • Complete but implausible: Required fields are present, yet the values break clinical logic, such as a death date before a birth date.
  • Loaded twice: EHR integration jobs create duplicate events that inflate counts and distort incidence analysis.
  • Technically valid, operationally stale: Local vocabulary tables fall behind official releases, so mappings age out of sync with the current standard.

Poor data quality costs the US economy alone $3.1 trillion annually, and in healthcare the risk is amplified by regulatory demands such as HIPAA. The same source notes that systematic checks at ingestion, transformation, and post-load stages can reduce transformation error rates below 0.5% (Precisely).

Why OMOP makes quality checking stricter

OMOP gives you consistency only if you enforce it. The model is structured, but the source data arriving at that structure is not. Local codes, custom billing values, free-text fragments, and site-specific conventions all collide in the ETL.

That is why mature data quality checking in OMOP has to cover both structure and meaning. It is not enough to ask whether a value fits a column type. You also need to ask whether the concept is current, standard, domain-appropriate, and clinically plausible in context.

Tip: If a check only proves that a row loaded, it is not a strong healthcare quality check. The useful checks are the ones that can stop bad clinical meaning from entering production.

A Framework for Data Quality in OMOP

A workable OMOP quality program needs a clear model. I use five pillars: conformance, completeness, plausibility, uniqueness, and consistency. They map cleanly to the kinds of failures that break clinical pipelines, and they are specific enough to turn into automated checks.

Infographic

These pillars are practical in OMOP because they line up with how the model is built and how ETL failures usually surface. Teams new to the model should review the OMOP Common Data Model structure and table design before formalizing checks, especially if source systems contain a mix of local codes, claims data, and EHR extracts.

The five pillars in OMOP terms

PillarKey QuestionOMOP Example
ConformanceDoes the data fit the model and rule set?person_id uses the expected type, required date fields are valid, concept fields contain valid identifiers
CompletenessIs required data present for the intended analysis?Mandatory columns are populated and source records that should map to standard concepts do
PlausibilityDo the values make clinical and temporal sense?Birth occurs before visits, observation periods are ordered correctly, lab values stay within realistic ranges
UniquenessAre duplicate records inflating counts or timelines?Replayed source feeds create duplicate visit_occurrence or observation rows
ConsistencyDo related values agree across tables, domains, and vocabulary versions?Person, visit, and concept relationships remain valid across the full dataset

Conformance defines the minimum bar

Conformance answers the first operational question. Can this dataset enter the OMOP environment without breaking the model?

That includes schema checks, data types, required fields, foreign key expectations, and accepted value patterns. In healthcare, I also treat vocabulary fit as part of conformance. A source code can pass a string-format check and still be wrong if it maps to a non-standard concept, an invalid concept, or the wrong domain. That trade-off matters in production because basic SQL validation is cheap to run, but it does not protect clinical meaning.

Completeness has to be use-case aware

A table can look full and still be unusable.

In OMOP, completeness is partly about missing values, but the harder problem is missing meaning. An encounter that lands in visit_occurrence without the expected provider context, a procedure that never receives a standard concept, or a drug exposure stripped of date detail may still load cleanly while weakening downstream studies. Good completeness checks are tied to the analytical purpose of the data, not just to column population.

Plausibility catches ETL defects that look legitimate

Plausibility checks screen for records that fit the schema but fail clinical review. Those are common in mature pipelines because many defects come from transformation logic, source system quirks, or stale assumptions rather than from obvious load failures.

Examples are straightforward. Pediatric records should not carry geriatric diagnoses without review. Observation periods should not end before they start. Units and values should agree. These checks take more effort than null counts, but they prevent analysts from working with data that is technically present and clinically wrong.

Uniqueness protects study results

Duplicate data is rarely a one-table problem. It usually starts with replayed interface messages, weak merge keys, or ETL reruns that are not idempotent.

In OMOP, uniqueness checks should focus on business meaning. Two rows can be distinct at the database level and still represent the same clinical event. I usually test for repeated combinations such as person, source event identifier, date, concept, and visit context. That approach catches the duplicates that distort incidence, utilization, and longitudinal analysis, rather than just the ones that violate a primary key.

Consistency is where API-first validation starts to matter

Consistency is the pillar that exposes whether the full pipeline still holds together after transformation, mapping, and vocabulary refreshes. It includes cross-table agreement, temporal logic, and vocabulary alignment.

This is also where static local vocabulary tables become a liability. A mapping that was acceptable last quarter may now point to an invalid or non-standard concept, and many teams do not notice until study review. An API-first approach helps here. Programmatic checks against live standardized vocabularies, including SNOMED and LOINC lookups through OMOPHub, let teams verify that codes and mappings are still current, standard, and domain-appropriate as part of routine validation rather than a separate manual audit.

Key takeaway: A useful OMOP quality framework checks whether data loaded correctly, whether it still means the right thing, and whether that meaning survives vocabulary change.

Implementing Foundational Validation Rules

A pipeline can load millions of rows and still fail the first serious chart review because nobody tested the basics at the point where defects were introduced. The fix is usually less complicated than the postmortem. Put a small set of high-signal rules at each stage of the flow, fail the build on hard defects, and keep the outputs readable enough that an analyst or ETL engineer can trace the problem back to source.

For OMOP, I use three execution points. Profile raw extracts before transformation. Validate business rules during transformation, while source fields and mapping decisions are still visible. Recheck after load in the target tables, where you confirm the data survived ETL in the form analysts will query.

Start with nulls, dates, and duplicates

These checks catch production defects early and with very little setup.

SQL for critical nulls in person

SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN person_id IS NULL THEN 1 ELSE 0 END) AS null_person_id,
  SUM(CASE WHEN gender_concept_id IS NULL THEN 1 ELSE 0 END) AS null_gender_concept_id,
  SUM(CASE WHEN year_of_birth IS NULL THEN 1 ELSE 0 END) AS null_year_of_birth
FROM person;

This query is basic on purpose. It answers a question every OMOP team should ask after load. Are required fields populated in the table that downstream studies depend on? If any of these counts are non-zero, stop and inspect the extract, join logic, and concept mapping before the data reaches a dashboard or model.

SQL for impossible person timelines

SELECT
  p.person_id,
  p.year_of_birth,
  d.death_date
FROM person p
JOIN death d
  ON p.person_id = d.person_id
WHERE d.death_date IS NOT NULL
  AND EXTRACT(YEAR FROM d.death_date) < p.year_of_birth;

This is a minimum plausibility rule. It will not prove the record is correct, but it will catch timelines that should never survive validation in a regulated setting.

SQL for duplicate primary identifiers

SELECT
  person_id,
  COUNT(*) AS row_count
FROM person
GROUP BY person_id
HAVING COUNT(*) > 1;

Post-load duplicate checks on OMOP identifiers are still necessary, even if the upstream pipeline already checks natural keys. In healthcare feeds, duplicated extracts, replayed messages, and retry logic can create clean-looking rows that represent the same event twice. Keep both checks. One protects data integrity in the warehouse. The other protects study results.

Add range and boundary checks where clinical logic matters

Type checks are not enough. A string can parse as a date and still be wrong. A numeric lab value can fit the column definition and still be clinically impossible.

SQL for age outliers

SELECT
  person_id,
  year_of_birth
FROM person
WHERE year_of_birth IS NOT NULL
  AND EXTRACT(YEAR FROM CURRENT_DATE) - year_of_birth > 120;

Use this as a triage rule, not a blanket deletion rule. Extreme ages often come from default values, century shifts, manual entry mistakes, or source system conversions. In some datasets, a few records will be legitimate. The right action is usually quarantine plus review, not automatic removal.

SQL for lab values outside physiological boundaries

SELECT
  measurement_id,
  person_id,
  measurement_concept_id,
  value_as_number,
  unit_concept_id
FROM measurement
WHERE value_as_number IS NOT NULL
  AND value_as_number > 25;

Rules like this work best when they are tied to a specific measurement and unit expectation. A threshold that makes sense for one analyte can be nonsense for another. The common failure mode is copying a generic numeric range rule across the whole measurement table and generating noise nobody trusts. Set boundaries where the clinical meaning is clear, and document the rationale.

That preparation also makes later vocabulary validation easier. If a measurement is out of range because the source code was mapped to the wrong LOINC concept, you want the foundational rule to catch the symptom and the API-based vocabulary check to confirm the mapping problem against current standards such as SNOMED and LOINC through OMOPHub.

Use Python for repeatable rule packaging

SQL is still the fastest way to inspect a defect in place. Python helps when the same rules need to run in Airflow, GitHub Actions, Azure DevOps, or another orchestration layer with consistent outputs.

import pandas as pd

def check_required_columns(df: pd.DataFrame, required_cols):
    results = []
    for col in required_cols:
        null_count = int(df[col].isna().sum()) if col in df.columns else None
        results.append({
            "column": col,
            "exists": col in df.columns,
            "null_count": null_count
        })
    return pd.DataFrame(results)

def check_duplicate_keys(df: pd.DataFrame, key_cols):
    dupes = df[df.duplicated(subset=key_cols, keep=False)].copy()
    return dupes.sort_values(key_cols)

def check_birth_before_visit(person_df: pd.DataFrame, visit_df: pd.DataFrame):
    merged = visit_df.merge(
        person_df[["person_id", "year_of_birth"]],
        on="person_id",
        how="left"
    )
    merged["visit_year"] = pd.to_datetime(merged["visit_start_date"], errors="coerce").dt.year
    failures = merged[
        merged["year_of_birth"].notna() &
        merged["visit_year"].notna() &
        (merged["year_of_birth"] > merged["visit_year"])
    ]
    return failures[["person_id", "year_of_birth", "visit_start_date"]]

Keep these rules simple. Readable validation code is easier to review with compliance teams, easier to version, and easier to debug at 2 a.m. after a vocabulary refresh or source schema change. Dense test logic tends to rot because only the original author is willing to touch it.

Practical tips that save time later

  • Check raw payload shape early: If you ingest JSON from APIs or partner systems, validate structure before transformation starts. A quick tool to check JSON syntax helps during incident response when malformed payloads cause downstream failures.
  • Separate blocking rules from warning rules: Duplicate identifiers, missing required columns, and invalid foreign key relationships should fail the pipeline. Age outliers or rare unit anomalies may belong in quarantine with review workflow.
  • Store failed rows, not only counts: Metrics support trend reporting. Failed-record extracts are what data stewards and interface teams need to fix the issue.
  • Version the rule set: Teams working under audit or validation requirements should be able to show which rules ran for a given ETL release, vocabulary version, and source delivery.
  • Tune thresholds with actual source behavior: A strict rule that fires on thousands of known edge cases will be ignored. A narrower rule with clear remediation ownership will be maintained.

Tip: Start with the defects your team has already seen in production. Those rules earn trust quickly, and they create the base layer for the semantic checks that come next.

Validating Vocabulary Mappings with the OMOPHub API

Static vocabulary extracts are still common in OMOP pipelines. They are also one of the easiest places for quality drift to hide.

The usual pattern is familiar. A team downloads ATHENA files, loads them into a local database, maps source codes, and revisits the vocabulary only when something breaks or a new release is scheduled. That works for a while. Then the local tables fall behind, a concept changes status, and the ETL keeps producing records that are structurally valid but semantically weak.

Hands interacting with a digital interface displaying data charts against a colorful abstract watercolor background.

A 2025 PMC study highlighted that many data quality methods do not programmatically validate against standardized vocabularies, which is a serious gap for OMOP users. The same source notes vocabulary drift when local mappings desynchronize from official releases, and reports that ATHENA version counts increased by 15% in recent years (PMC study).

Why static tables fail in day-to-day operations

Local vocabulary tables are not wrong. They are just easy to neglect.

The trade-off is operational:

  • Local tables give you direct control and familiar SQL access.
  • API-first validation gives you a current validation surface, easier version tracking, and a cleaner audit trail for mapping checks.

For teams maintaining multiple pipelines, API checks are often easier to standardize than custom vocabulary databases in each environment. They also work better when you need mapping validation inside Python and R jobs, not only inside warehouse SQL.

If you are working through standard-to-source and source-to-standard logic, the vocabulary concept maps guide is a practical companion to this workflow.

What to validate programmatically

For every code or concept entering an OMOP target table, I want to know at least four things:

  1. Does the concept exist?
  2. Is it standard?
  3. Does it belong to the expected domain?
  4. Does the mapping still align with the vocabulary version we intend to use?

That is where an API-first workflow is useful. OMOPHub is one option for this pattern. It provides API access to ATHENA-standardized vocabularies, along with SDKs for Python and R, so validation can run inside ETL code rather than as a separate manual review step. The documentation lives at docs.omophub.com, and the web-based Concept Lookup tool is handy for quick spot checks during mapping review.

Python example for concept validation

The author brief asked for examples aligned with the published SDKs and documentation. The safest practitioner pattern is to keep the validation wrapper thin and explicit.

from omophub import OMOPHub

client = OMOPHub(api_key="YOUR_API_KEY")

code = "E11.9"

results = client.concepts.search(
    query=code,
    vocabulary=["ICD10CM"]
)

for concept in results:
    print({
        "concept_id": concept.get("concept_id"),
        "concept_name": concept.get("concept_name"),
        "standard_concept": concept.get("standard_concept"),
        "domain_id": concept.get("domain_id"),
        "vocabulary_id": concept.get("vocabulary_id"),
        "invalid_reason": concept.get("invalid_reason")
    })

The point of a check like this is not just lookup. It is enforcement. In production, the ETL should reject or quarantine codes that return no valid match, map to a non-standard concept where a standard one is required, or land in the wrong domain for the target OMOP table.

A simple wrapper might look like this:

def validate_condition_code(client, code):
    matches = client.concepts.search(query=code, vocabulary=["ICD10CM"])
    failures = []

    for concept in matches:
        if concept.get("invalid_reason") is not None:
            failures.append((code, "invalid concept"))
        if concept.get("domain_id") != "Condition":
            failures.append((code, "wrong domain"))
    return failures

R example for pipeline-side review

R remains common in OMOP analytics and DQ review, so the same idea should be available there.

library(omophub)

client <- OMOPHub$new(api_key = "YOUR_API_KEY")

results <- client$concepts$search(
  query = "4548-4",
  vocabulary = c("LOINC")
)

print(results)

This pattern is useful for validating laboratory mappings before values land in measurement. If the code exists but maps outside the expected clinical domain, that should not be treated as a successful transformation.

What works better than ad hoc spot checks

Manual spot checks are fine for concept set exploration. They are weak as a production control.

What works better is:

  • Validate during ingestion: Catch codes that cannot map cleanly before expensive transforms.
  • Validate at transformation time: Confirm the chosen concept is standard and domain-appropriate.
  • Log the vocabulary version context: Keep an auditable record of what was validated and when.
  • Re-run targeted checks after vocabulary updates: Focus on previously problematic mappings and high-risk source systems.

Tip: Treat vocabulary validation as a first-class data quality check, not as terminology administration. In OMOP, bad mappings are data quality defects.

Automating Data Quality in a CI/CD Pipeline

Data quality checking breaks down when it depends on a person remembering to run a script before release. That model does not survive team growth, multiple source feeds, or frequent vocabulary changes.

The practical fix is to treat data quality like application quality. Put tests in version control. Run them on every relevant change. Fail builds when high-risk rules break. Record results in a place the team watches.

A hand reaching towards a digital flow of data passing through quality check portals and gears.

A workable CI pattern for OMOP teams

A clean pipeline usually has four validation points:

  • Pre-merge checks: SQL linting, unit tests, schema rules, and lightweight mapping validation.
  • Build-time checks: dbt tests, custom SQL assertions, and synthetic sample validations.
  • Post-load checks: OMOP table validation, referential integrity, and plausibility rules.
  • Monitoring after deployment: trend tracking, alerting, and issue routing.

dbt test and Great Expectations both fit well here. dbt is strong for warehouse-native assertions close to transformation logic. Great Expectations is useful when teams want richer validation definitions, data docs, and Python-based orchestration around expectations.

The workflow should also include a quality dashboard. The OMOPHub data quality dashboard article is a useful reference for thinking about how to surface conformance, completeness, and plausibility signals in an OMOP setting.

What to automate first

Do not automate every possible test on day one. Automate the checks that are painful to review manually and expensive to miss.

A strong first batch usually includes:

  • Critical null checks: Required identifiers, dates, and standard concept fields.
  • Duplicate detection: Primary keys, source event keys, and high-risk merge points.
  • Rule failures: Temporal logic, impossible values, and domain mismatches.
  • Vocabulary validation failures: Codes that no longer resolve or no longer qualify for the target table.

Teams coming from software engineering often find it useful to frame this alongside broader DevOps Quality Assurance practices. The mindset is the same. Quality is enforced through repeatable controls, not heroic manual review.

After your rules are stable, add alert routing and remediation states. A failed data quality check should create an actionable event with an owner, not just a red mark in a build log.

Monitoring matters after the deploy

Release-time checks are necessary. They are not enough.

Source systems change without notice. Documentation style in clinical notes evolves. A billing workflow gets updated. A partner feed starts sending a field as a different type. Some failures appear only in production patterns over time.

That is especially true if AI or NLP is part of the quality layer. A critical challenge in AI-driven data quality is model drift. In clinical NLP, drift can cause 12-18% accuracy drops in six months, and continuous monitoring plus biphasic human-AI validation are essential even though practical frameworks remain under-discussed (IMO Health).

A short walkthrough is useful here because the operational mindset matters as much as the tooling:

The trade-off many teams learn the hard way

Fully blocking pipelines on every anomaly creates alert fatigue. Letting everything through creates distrust.

The better model is tiered enforcement:

SeverityExampleAction
HighDuplicate primary keys, invalid required concept mappingsFail the pipeline
MediumPlausibility outliers, unusual null spikesQuarantine and alert
LowNon-critical distribution shiftsMonitor and review

That structure keeps the pipeline strict where it must be strict, and flexible where investigation is more appropriate than immediate failure.

Key takeaway: CI/CD for healthcare data is not complete until data quality checking is part of the release contract.

Building a Culture of Data Quality

Tools help. Rules help. Automation helps more. None of it lasts if the organization treats data quality as a side task owned only by the ETL team.

In OMOP environments, durable quality comes from shared responsibility. Source system owners need to understand which fields break mappings. Data engineers need to encode checks where failures occur. Researchers need to report suspicious outputs as quality signals, not just analysis oddities. Compliance and platform teams need auditability and release discipline built into the process.

The most effective teams start smaller than they expected. They do not launch with a giant framework and hundreds of brittle checks. They start with a short list of failure modes that already caused pain. Then they automate those rules, assign owners, and review the failures in the same way they review pipeline incidents.

A few habits consistently work well:

  • Tie checks to business use: Validate the fields and mappings that affect cohort definition, reporting, and regulated outputs first.
  • Keep remediation visible: Failed records should move into a known queue with ownership, notes, and resolution history.
  • Review rule drift: The test suite itself needs maintenance as source systems and vocabularies evolve.
  • Reward prevention: Fix recurring defects at the source instead of cleaning them repeatedly downstream.

Data quality checking in OMOP is not a compliance ornament. It is how you preserve semantic integrity from source extraction through analytics. If the data team can prove that codes are current, mappings are appropriate, duplicates are controlled, and plausibility rules are enforced, the rest of the organization moves faster because it trusts the dataset.

The practical shift is simple. Stop treating data quality as a report about what already went wrong. Treat it as part of how the pipeline is built, reviewed, deployed, and maintained.


If your team needs a developer-friendly way to validate standardized vocabularies during OMOP ETL, OMOPHub provides API access to ATHENA vocabularies, SDKs for common languages, and workflow support for programmatic concept validation without standing up and maintaining a local vocabulary database.

Share: