The OHDSI Data Model: A Practical Explainer for 2026

Dr. Rachel GreenDr. Rachel Green
June 23, 2026
18 min read
The OHDSI Data Model: A Practical Explainer for 2026

A new team member usually meets OMOP at the worst possible moment. The study question is already defined, the analysts want cohorts next week, and the source data is scattered across an EHR warehouse, claims extracts, and a registry feed that all describe the same clinical events differently.

One system stores diagnoses in ICD-10-CM. Another carries local billing codes and free-text descriptions. Medications arrive as NDC in one feed and local formulary identifiers in another. Lab results are structured, but the units drift and the code systems don't line up cleanly. Everyone agrees the data is “there,” but nobody can answer a simple cross-source question without arguing about code mappings first.

That's the problem the OHDSI data model was built to solve. Not just a schema problem. A semantics problem, an ETL problem, and a governance problem. When people talk about OMOP as if it's only a set of tables, they miss the hard part. The hard part is making different systems mean the same thing in a way that survives audits, reruns, and multi-site analysis.

From Data Chaos to Clinical Insight

A common scene in observational research starts with a reasonable request: identify patients with diabetes, look at medication exposure, then compare outcomes across sites. The request sounds straightforward until the team opens the source systems.

The EHR has diagnosis records tied to encounters. Claims has billing-oriented codes with different timing and granularity. A registry has curated disease information but only for a subset of patients. The same person may appear differently in each feed, and the same condition may be represented with multiple code systems and local conventions.

Where teams usually get stuck

The first bottleneck isn't SQL. It's agreement.

Analysts ask which code set defines the phenotype. ETL developers ask which source is authoritative when dates disagree. Clinical reviewers ask whether a mapped concept is broad enough or too broad. If the answer to “what counts as this condition?” changes by source, every downstream number shifts.

A few patterns show up again and again:

  • Code fragmentation: One concept is spread across ICD, SNOMED CT, local terms, and historical legacy values.
  • Event ambiguity: A diagnosis can represent a rule-out, a billing artifact, or a clinically established condition.
  • Context loss: Source data often omits the standardized context needed for comparable analysis.
  • Rework loops: Teams remap the same concepts for each project because the mapping logic was never turned into a reusable asset.

The data problem in observational research usually isn't lack of records. It's lack of consistency.

Why a common model changes the work

A common data model gives the team a shared structure for people, visits, conditions, drugs, procedures, measurements, and observations. Beyond this, it gives those records a standardized vocabulary layer so that “diabetes” doesn't have ten competing meanings depending on which feed you queried.

That's why OMOP matters in practice. It lets data engineers build once, quality teams validate once, and analysts reuse patterns across studies. You still have to make hard decisions, but you make them explicitly and capture them in ETL and vocabulary logic instead of hiding them in one-off scripts.

History and Purpose of the OMOP CDM

A team usually starts caring about OMOP after the first failed cross-source analysis. Claims says one thing, the EHR says another, and every site has coded the same clinical idea differently. OMOP was created to make that kind of observational data usable for repeatable evidence generation, then matured through OHDSI into a shared standard with common methods, open tooling, and working governance.

That history matters because it explains the model's priorities. OMOP was built for distributed research across institutions that do not hand raw patient-level data to a central party. The design supports local data control, shared study logic, and results that can be compared across sites without rewriting every analysis from scratch.

Why the model looks the way it does

The schema reflects research operations, not just storage design.

Clinical events are separated from the vocabulary layer so teams can preserve source detail while mapping to standard concepts used in analysis. Domain-specific tables create stable targets for cohort definitions, characterization, safety studies, and treatment pathway work. In practice, that means the ETL has to do more than load rows. It has to make explicit decisions about what an event means, which standard concept represents it, and how much source nuance must be retained for auditability.

That design works well for several common use cases:

  • Comparative effectiveness research: Teams can define exposures, outcomes, and covariates in a form that other sites can run with limited local rewrites.
  • Safety surveillance: Repeated signal detection depends on consistent event representation across data partners.
  • Phenotyping: Concept sets can be reviewed, versioned, and reused instead of rebuilt for each study.
  • Real-world evidence programs: The same model can support exploratory analysis and more controlled, method-driven work, if the ETL and vocabulary practices are disciplined.

Community governance is part of the model

New implementers often focus on the tables and miss the harder part. OMOP is maintained as a public standard by a community that revises conventions, vocabulary practices, and implementation guidance under real production pressure. That is one reason the model has held up across claims, EHR, registry, and mixed-source environments.

The version history also matters operationally. Different ETL assumptions, field availability, and vocabulary dependencies can change with CDM releases. A team that says it is "on OMOP" still has to answer basic questions: which CDM version, which vocabulary release, what local conventions, and how historical mappings were carried forward. Those answers affect reproducibility more than the schema diagram does.

Practical rule: Treat OMOP as a research operating model, not just a database schema.

I tell new team members to expect the core work after the first load succeeds. The hard part is keeping mappings explainable, vocabulary updates controlled, and ETL logic versioned so another analyst, or your future self, can understand why a concept landed where it did. Teams that handle that discipline well get reusable analytics. Teams that do not end up with a technically valid OMOP instance that nobody fully trusts.

Anatomy of the OHDSI Data Model Tables

When people first inspect OMOP, they often fixate on the table count. That's the wrong way to learn it. Start with the patient journey, then map tables to that journey.

As of version 5.3, the CDM includes 32 core tables covering demographics, visits, conditions, procedures, drugs, measurements, and observations, as described in OHDSI's data standardization documentation. Think of those tables as coordinated layers rather than isolated objects.

A diagram illustrating the anatomy of the OHDSI OMOP Common Data Model, showing its main table categories.

Domain tables hold the clinical story

The center of OMOP is the event model. These are the tables analysts reach for first because they capture what happened to a person over time.

Some of the most important ones are:

  • PERSON: The patient anchor. Demographics live here, and nearly every clinical table joins back to it.
  • CONDITION_OCCURRENCE: Recorded diseases, diagnoses, and clinical conditions.
  • DRUG_EXPOSURE: Medication dispensing, prescribing, or administration events.
  • PROCEDURE_OCCURRENCE: Procedures, interventions, and operationally coded acts.
  • MEASUREMENT: Labs, quantitative findings, and test results.
  • OBSERVATION: Clinical facts that don't fit neatly into the other domains.
  • DEATH: Mortality information when available.

A new implementer should notice one thing right away. OMOP doesn't try to force every source field into one “catch-all” event table. It separates domains because different analyses care about different semantics.

Health system tables add operational context

Clinical events without context become misleading fast. OMOP uses supporting tables to show where and how the data was produced.

The most important contextual tables are:

TableWhy it matters
VISIT_OCCURRENCETies events to encounters and care episodes
PROVIDERSupports clinician-level attribution where available
CARE_SITEDistinguishes facilities, departments, or organizational units
LOCATIONAdds geographic or administrative context

Without these, event-level facts become detached from the care setting. That's when inpatient, outpatient, emergency, and specialty workflows start to blur in ways that break study logic.

Vocabulary and metadata tables make the model usable

The event tables get most of the attention, but they aren't the reason OMOP works across institutions. The vocabulary tables are.

Tables such as CONCEPT, CONCEPT_RELATIONSHIP, and related vocabulary assets tell you what a code means, whether it's standard, how it maps, and how it sits in a hierarchy. Metadata and administrative tables support provenance, eras, observation windows, and implementation conventions.

If you load the domain tables without a disciplined vocabulary layer, you've built an OMOP-shaped database, not an analytically reliable OMOP instance.

A practical way to teach the model is to walk one patient through it. PERSON identifies who. VISIT_OCCURRENCE identifies the encounter. CONDITION_OCCURRENCE captures the diagnosis. DRUG_EXPOSURE records treatment. MEASUREMENT stores labs. OBSERVATION fills the semantic gaps. The vocabulary system standardizes the meaning of every coded element across that chain.

The Critical Role of Standardized Vocabularies

A team can finish the table build, load millions of rows, and still end up with an OMOP instance that no analyst trusts. The failure point is usually vocabulary management. Diagnoses arrive in ICD-10-CM, medications in NDC, labs in local codes, procedures in a mix of billing and internal terminologies. If those codes are not resolved consistently to OMOP standard concepts, every downstream artifact is unstable, including cohorts, phenotypes, incidence estimates, and site-to-site comparisons.

The OMOP vocabulary layer gives the model its shared meaning. The CONCEPT table defines the codes and their status. CONCEPT_RELATIONSHIP and the related hierarchy tables connect source codes to standard concepts, domains, ancestors, and valid mappings. That structure is what lets an implementation translate a source diagnosis into a standard SNOMED concept, expand a concept set correctly, or determine whether a code is valid for a target table.

Standard concepts are the contract

A source code can be correct in the local system and still be the wrong analytic target.

In OMOP, source values and standard values serve different purposes. The source code preserves provenance. The standard concept carries the meaning used for analysis. That split is easy to explain and hard to enforce, especially when teams are under delivery pressure and decide to load raw billing codes directly into domain tables. The short-term gain is speed. The long-term cost is rework.

The common failure modes are predictable:

  • Phenotype drift: A cohort definition returns different patients at different sites because each site interpreted the same source vocabulary differently.
  • Mapping debt: Analysts and ETL developers keep solving the same code translation problems project by project.
  • Opaque results: Reviewers cannot tell whether variation reflects clinical reality, source system behavior, or a bad mapping choice.

A practical starting point is to document mapping intent explicitly: source code system, target standard concept, domain, valid date range, and fallback behavior when no standard mapping exists. Teams that need a concrete reference can use this guide to vocabulary concept maps.

The self-hosted ATHENA pattern is valid, but it creates ongoing work

Many organizations load ATHENA vocabularies into a local PostgreSQL database and point ETL jobs at those tables. That model works. In regulated or air-gapped environments, it may be the only acceptable option.

It also creates a maintenance stream that often gets underestimated. Someone has to load each vocabulary release, validate what changed, rebuild indexes, test impacted mappings, and make sure every ETL job is resolving concepts against the same snapshot. Search and terminology services add another layer. If users want fuzzy search, autocomplete, hierarchy traversal, or a programmatic API for concept resolution, the team has to build and support those capabilities.

That leads to architecture decisions that should be made early, not discovered halfway through ETL.

A practical comparison

The table below reflects the trade-offs teams usually face when they operationalize OMOP vocabularies.

CapabilitySelf-hosted ATHENAOMOPHub
Setup time1–2 days5 minutes (get an API key)
Vocabulary updatesManual re-download & re-load every ~6 monthsAutomatic, synced with ATHENA
Full-text / semantic / autocomplete searchBuild your ownBuilt-in
REST API, Python SDK, R SDK, MCP serverBuild your ownIncluded
FHIR Terminology ServiceBuild your own / deploy SnowstormBuilt-in
FHIR Concept Resolver (Coding → OMOP + CDM table)Not a standard OHDSI toolBuilt-in (POST /v1/fhir/resolve)
Infrastructure cost$150–400/month (DB + compute)Free tier; paid tiers for volume
Maintenance burdenOngoingZero

The right choice depends on constraints. Self-hosting is often appropriate for local extensions, strict outbound network controls, or environments where terminology assets must stay inside a controlled boundary. API-first tooling is often better when the bottleneck is ETL throughput, shared concept resolution, or keeping multiple teams on the same vocabulary version.

OMOPHub is one option for teams that want direct, programmatic access to OHDSI vocabulary content through REST and FHIR interfaces instead of operating a local terminology stack. It supports search, hierarchy traversal, code translation, and FHIR terminology functions. The public concept browser at OMOPHub Concept Lookup is also useful during mapping design reviews.

Field note: Vocabulary problems rarely stay isolated in the vocabulary layer. A bad concept choice in ETL becomes a cohort bug, a QA exception, or a study reproducibility problem later. The teams that succeed treat vocabulary operations as production infrastructure, not reference data sitting off to the side.

Practical ETL and Data Mapping Workflows

The ETL phase is where OMOP projects either become repeatable or become fragile. Most implementation pain comes from treating mapping logic as a one-time conversion task instead of a maintained software asset.

A recurring gap in public guidance is how to keep mappings current and operationalized over time. The challenge is described clearly in Data4Life's discussion of OMOP implementation gaps: organizations often script ad hoc mappings that become brittle as source vocabularies change, which leaves teams without versioned, queryable, automated mapping pipelines.

A conceptual illustration of an ETL process transforming various data types into the OMOP Common Data Model.

A workflow that holds up in production

A practical OMOP ETL usually has four parts.

  1. Characterize the source first. Profile cardinality, null patterns, code-system spread, and date behavior before you write target logic.
  2. Write mapping specifications by target table. Define table, column, source field, transformation rule, vocabulary rule, and fallback behavior.
  3. Separate structural transforms from semantic mapping. Table loading logic and concept resolution shouldn't be tangled together in one opaque SQL script.
  4. Version everything. That includes the ETL code, the mapping tables, the vocabulary release, and the data-quality results.

Different tables require different mapping discipline. CONDITION_OCCURRENCE often needs careful provenance handling. DRUG_EXPOSURE usually needs stronger attention to ingredient versus product semantics. MEASUREMENT needs unit handling and value-as-concept patterns that analysts can trust later.

Resolve codes as a service, not a spreadsheet

A lot of teams still keep source-to-standard mappings in manually curated files with little lineage. That's tolerable for a pilot. It doesn't scale.

If your source arrives as FHIR coding, or you can normalize source values into FHIR-like system and code pairs, you can turn concept resolution into a service call rather than a custom join sequence. That reduces local vocabulary plumbing and makes mapping behavior easier to test.

This is the kind of example worth validating directly against the OMOPHub API and LLM documentation:

curl example for resolving a SNOMED code to an OMOP standard concept and CDM target table:

curl -X POST "https://api.omophub.com/v1/fhir/resolve" \
  -H "Authorization: Bearer oh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"system": "http://snomed.info/sct", "code": "44054006", "resource_type": "Condition"}'

That pattern is useful because the response can drive both the concept mapping and the target-domain decision in one step, with Maps to traversal handled server-side.

For teams working with older billing systems or migration projects, this ICD conversion article is a useful example of where cross-vocabulary translation needs to be handled carefully instead of guessed.

Before you automate heavily, it helps to watch the workflow end to end:

Tips that save time later

  • Keep a mapping registry: Store source code, source description, selected standard concept, rationale, reviewer, and effective vocabulary version.
  • Test edge concepts early: Rare oncology drugs, deprecated procedure codes, and local panels will expose weaknesses faster than common diagnoses.
  • Use SDKs where they reduce glue code: The Python SDK, R SDK, and MCP server are practical when your ETL or analyst tooling already lives in those environments.
  • Check implementation examples before coding: The OMOPHub docs are useful for concrete request and response patterns.

QA Versioning and Common Pitfalls

A team usually feels confident right after the first full load. The tables exist, row counts look close, and the validation tools return more green than red. Then the first serious study starts, and someone asks which vocabulary release was used for the drug mappings, whether a deprecated concept slipped into production, or why the same source code landed in different domains across two ETL runs. That is when QA and version control stop being administrative work and become part of the data architecture.

An OMOP instance is trustworthy only when the team can explain what loaded, why it loaded that way, and which CDM and vocabulary versions were in effect. Schema conformance is only the starting point. Reproducibility depends on recording the model version, the vocabulary snapshot, the ETL code release, and any local mapping overrides that changed behavior.

A magnifying glass inspecting a data workflow diagram featuring Input Data, Validation, Versioning, and Accuracy steps.

What to validate every time

Good OMOP QA checks structure and meaning. Achilles and the Data Quality Dashboard help, but they do not know your local source quirks, your custom mapping decisions, or which compromises were made during ETL. Teams need that context documented before they trust automated checks.

A practical QA pass should include:

  • Row-level completeness: Did each expected source event land in the right target table, or at least in a reviewed fallback path?
  • Concept validity: Are standard concept IDs active, domain-appropriate, and still valid for the vocabulary version in production?
  • Date logic: Do event dates fit visit timing, person timelines, and observation periods?
  • Distribution checks: Do counts, units, ranges, and null patterns look plausible by source system and site?
  • Version traceability: Can the team reproduce the exact result with the same ETL code, vocabulary release, and mapping registry?

Teams building that discipline usually benefit from a practical guide to OMOP data quality checking.

Common failure modes

The failures that hurt most are usually quiet.

PitfallWhat it breaks
Wrong domain assignmentEvents land in tables that change cohort logic or downstream counts
Lossy source mappingClinical detail disappears, often permanently, during standardization
Unsynchronized vocabulary releasesSites produce different results because concept relationships changed at different times
Performance-blind ETL SQLReloads, regression testing, and QA runs become slow enough that teams stop running them often

Wrong domain assignment is a classic example. A source code may look like a diagnosis to the ETL developer, but vocabulary relationships may point to Observation or Measurement depending on context. If that decision is hard-coded too early, analysts inherit a semantic error that no structural validator will catch.

Vocabulary drift causes a different class of problem. One team refreshes vocabularies quarterly, another updates once a year, and both believe they are using OMOP correctly. Then concept replacements, hierarchy changes, or validity date shifts alter rollups and concept set behavior. Cross-site disagreement follows, even when the SQL is identical. The fix is simple in principle and tedious in practice. Treat vocabulary releases like application dependencies, pin versions, test upgrades in a staging environment, and record the effective version in every ETL run.

Passing DQD does not prove that the semantic choices were sound.

Performance matters too. Large vocabulary joins, concept ancestor expansion, and full-table scans can make validation and reload cycles painfully slow. Analysts often describe this as "OMOP is slow." In practice, the root cause is usually implementation design: missing indexes, poor partition strategy, repeated expansion logic, or ETL code that recomputes the same mappings every run. Mature teams cache expensive lookups, materialize helper tables where appropriate, and treat rerun time as a QA requirement, not an infrastructure detail.

Leveraging the OMOP CDM for Analysis

Once the ETL and vocabulary work is stable, the payoff is real. Analysts can define concept sets once, build cohorts with less site-specific rewriting, and move from exploratory counting to defensible observational research.

The value isn't only technical uniformity. It's shared analytical behavior. A condition definition, a drug exposure logic, or a measurement-based phenotype can be reviewed once and executed repeatedly with much less ambiguity than in raw source systems.

What becomes easier

A well-maintained OMOP environment supports work that is otherwise expensive to repeat:

  • Cohort construction: Standard concepts and hierarchies make inclusion logic more portable.
  • Phenotype development: Researchers can inspect related concepts systematically instead of chasing local code lists.
  • Multi-site studies: Federated methods become feasible because local data conforms to common rules.
  • Real-world evidence pipelines: Teams spend more time on study design and less on recoding source quirks.

That's the practical lesson I'd want a new team member to absorb early. The OHDSI data model is not a paperwork exercise for interoperability committees. It's a disciplined way to turn heterogeneous clinical data into a research platform that can survive scrutiny, reruns, and collaboration.


If your team is struggling more with vocabulary operations than with the OMOP schema itself, OMOPHub is worth evaluating as a practical way to query ATHENA-aligned vocabularies, resolve FHIR codes to OMOP concepts, and reduce the local infrastructure burden around terminology-heavy ETL.

Share: