A Practical Guide to Healthcare Common Data Models

Dr. Rachel GreenDr. Rachel Green
March 5, 2026
23 min read
A Practical Guide to Healthcare Common Data Models

If you’ve ever tried to combine patient data from two different hospitals, you know the feeling. It’s like trying to assemble a puzzle where the pieces come from different boxes—they just don’t fit. One hospital calls a heart attack a "myocardial infarction" with code X, while another uses "acute MI" with code Y. This is the core challenge of healthcare data: it’s siloed, inconsistent, and incredibly difficult to analyze at scale.

This is where a common data model (CDM) comes in. Think of it as the shared language that finally allows all those different puzzle pieces to connect. It standardizes the chaos into a single, coherent format built for analysis.

Why Common Data Models Are Central to Modern Healthcare Analytics

Two people connect EHR and Research puzzle pieces over a globe, symbolizing data integration.

Healthcare organizations are swimming in data from electronic health records (EHRs), billing systems, pharmacy databases, and lab reports. The problem is, each system speaks its own language, using proprietary structures and terminologies. This creates massive roadblocks for any kind of large-scale research. In fact, a successful healthcare analytics implementation almost always begins with establishing a CDM.

A common data model is more than just a database schema. It's a shared philosophy—a collective agreement on how to structure, define, and connect information. It’s what turns messy, raw data into a reliable, research-ready asset.

From Disparate Data to Actionable Discovery

You can think of a CDM as the architectural blueprint for your data warehouse. It lays out a standard set of tables (like person, condition_occurrence, drug_exposure), fields, and the relationships between them. Leading CDMs like OMOP, PCORnet, and i2b2 each offer a proven framework for mapping your source data into a single, consistent format.

Once your data conforms to this model, you can do some truly powerful things:

  • Enable Federated Research: Researchers can execute the same analysis script across datasets from dozens of institutions without ever moving or seeing the raw patient data. This protects privacy while dramatically increasing study power.
  • Speed Up Analysis: With data already harmonized, your team spends far less time on tedious data cleaning and preparation. You can get straight to asking the important questions and finding answers.
  • Power AI and Machine Learning: Reliable AI models demand clean, standardized data. A CDM provides the structured foundation needed to train algorithms for everything from disease prediction to identifying treatment patterns.

Getting Started with Your CDM Implementation

This guide is for the people in the trenches: the data engineers, researchers, and IT leaders tasked with making this happen. We'll walk through the practical decisions you'll face, from choosing the right model to navigating the tough but critical ETL and vocabulary mapping work.

A huge piece of the puzzle is vocabulary mapping—the process of translating all your local, source-specific codes into the CDM’s standard terminologies. Trying to manage this manually in spreadsheets is a recipe for error and burnout. Thankfully, modern tools can help.

Tip: Before you write a single line of ETL code, get a feel for the vocabularies. Use an interactive tool like the OMOPHub Concept Lookup to search for your local codes and see how they map to the OMOP standards. This gives you a realistic preview of the mapping work ahead and helps you scope the project accurately.

Choosing Your Model: A Landscape of Healthcare CDMs

Picking a common data model is one of the most foundational decisions your organization will make. This isn't just a technical detail; it's a strategic move that will define your research capabilities, shape your data workflows, and dictate your ability to collaborate with others for years. At its heart, this choice is about designing effective software architecture that can grow with your long-term goals.

When you survey the landscape, you'll find a handful of major models, each with its own philosophy, structure, and sweet spot. To make the right call, you have to look past the marketing and get to the core design principles of each one. Let's break down the big four—OMOP, PCORnet, i2b2, and Sentinel—to see how they stack up.

OMOP: The Patient-Centric Standard

The Observational Medical Outcomes Partnership (OMOP) CDM, championed by the OHDSI community, has become the dominant force in observational research. Its entire philosophy is built on being patient-centric and longitudinal. Every single data point, whether it's a diagnosis, a lab result, or a prescription, is tied directly to a patient and placed on their unique timeline.

This design is what makes OMOP so incredibly powerful for large-scale studies where researchers need to track patient journeys over extended periods, often across different hospitals or even countries.

  • Structure: It’s a relational database with a person table at its core. From there, everything connects to event tables like condition_occurrence, drug_exposure, and visit_occurrence.
  • Vocabulary: OMOP's biggest strength is its strict use of standardized vocabularies. It forces you to map all your local data to robust terminologies like SNOMED CT, RxNorm, and LOINC, which are all managed through the ATHENA repository.
  • Use Case: It's built for population-level analytics, safety surveillance, and comparative effectiveness research, especially within global networks.

The rigid vocabulary standard is precisely why OMOP excels in federated research. For a complete breakdown, check out our deep dive on the OMOP data model.

PCORnet: For Patient-Centered Outcomes

The PCORnet (Patient-Centered Outcomes Research Network) CDM was born from a clear mission: to make it easier to conduct research that directly answers questions that matter to patients. It shares some DNA with OMOP but puts a special emphasis on data elements crucial for comparing treatments and including patient-reported outcomes.

PCORnet’s design is pragmatic, aiming for a balance that encourages quick adoption and participation. It allows more flexibility with source vocabularies, which can make the initial ETL work simpler, though it might mean more cleanup during the analysis phase.

The core idea behind PCORnet is to create a "network of networks" that can rapidly answer questions important to patients and clinicians. Its structure is optimized for this purpose, balancing standardization with practicality.

For instance, PCORnet has dedicated tables for patient-reported outcomes and provider details, showing its commitment to capturing a fuller picture of the patient's experience.

i2b2: For Rapid Cohort Discovery

Informatics for Integrating Biology and the Bedside (i2b2) comes at the problem from a totally different angle. Its architecture is a star schema, a design that’s lightning-fast for running queries. This makes i2b2 an incredible tool for cohort discovery—the process of quickly finding groups of patients that fit very specific criteria.

You can think of i2b2 as a high-performance search engine for your clinical data. A researcher can literally drag and drop concepts like "patients with type 2 diabetes" and "taking metformin" to build and refine a patient list in real-time.

  • Structure: A central "fact" table holds the patient events, which is surrounded by "dimension" tables that add context (like patient demographics, concepts, and providers).
  • Vocabulary: While i2b2 can use standard vocabularies, it’s often set up with local, institution-specific ontologies. This offers a lot of flexibility.
  • Use Case: It's perfect for clinical trial recruitment, running feasibility studies, and giving clinicians a way to explore their own data without needing to be programmers.

This user-friendly, query-focused design is exactly why i2b2 has been a fixture in academic medical centers for years, where quickly finding patients for studies is a daily necessity.

Sentinel: A Specialized Model for Drug Safety

The Sentinel CDM was created by the U.S. Food and Drug Administration (FDA) with one job in mind: post-market drug and medical product safety surveillance. Its design is lean, mean, and laser-focused on running efficient and repeatable safety analyses across a huge network of data partners.

This is not a general-purpose research tool. Sentinel is stripped down to the essential data elements needed to investigate potential safety signals and nothing more.

Its streamlined nature allows the FDA to quickly execute queries to check if a new drug is linked to a higher risk of a specific side effect. This focus ensures that when public health is on the line, the analyses are consistent, transparent, and fast.

Comparative Analysis of Major Healthcare Common Data Models

Choosing a CDM is a significant commitment. The table below offers a side-by-side comparison to help you align your organizational goals with the model that best supports them, whether your priority is global collaboration, rapid cohort building, or regulatory reporting.

FeatureOMOP CDMPCORnet CDMi2b2Sentinel CDM
Primary GoalLarge-scale observational research and evidence generationPatient-centered comparative effectiveness researchCohort discovery, hypothesis generation, and clinical trial recruitmentPost-market drug and medical product safety surveillance
Core StructureRelational, patient-centric model with event tablesPatient-centric model, includes patient-reported outcomesStar schema optimized for fast queries and cohort buildingHighly curated, streamlined model focused on essential safety data
VocabularyMandatory mapping to standardized vocabularies (SNOMED, RxNorm, LOINC)Flexible; supports source codes but encourages mapping to standardsFlexible; often uses local or institution-specific ontologies, but can support standardsHighly standardized and curated vocabularies specific to safety analysis
Community/NetworkOHDSI (Observational Health Data Sciences and Informatics), a large, active global communityPCORnet, a national network of health systems and patient groups in the U.S.i2b2 tranSMART Foundation, widely adopted in academic medical centersSentinel Initiative, a national network coordinated by the U.S. FDA
Best For...Federated network studies, population health analytics, developing and validating predictive modelsRapid-cycle research, studies requiring patient-reported outcomes, and answering questions relevant to clinical practiceEmpowering clinicians and researchers to explore data, quickly check study feasibility, and identify patient cohorts for trialsRegulatory agencies and partners conducting safety analyses, signal detection, and evaluating medical product risks

Each model has a distinct purpose and community. Your choice will depend on whether you need the analytical depth of OMOP, the patient-centered focus of PCORnet, the query speed of i2b2, or the regulatory precision of Sentinel. Understanding these core differences is the first step toward building a successful and sustainable data ecosystem.

Getting Your Data into a Common Data Model: The ETL Challenge

Picking the right common data model is a crucial first move, but it's just the start. The real heavy lifting happens during the ETL (Extract, Transform, Load) process, and frankly, this is where most projects get bogged down. Your raw source data is never as clean or organized as you hope; it’s a messy tangle of different formats, local coding systems, and structural quirks.

Turning that raw data into a pristine, research-ready state isn’t just about converting files. It’s a deep, deliberate transformation that requires a solid plan and a whole lot of attention to detail.

The Key Steps in a CDM Transformation

Getting data into any of the major common data models follows a pretty standard, if challenging, path. It all starts with figuring out exactly what you're working with.

  • Data Profiling: Think of this as your initial scouting mission. You need to analyze your source data to understand its structure, quality, and completeness. This is where you'll find all the inconsistencies, missing values, and weird formatting choices that need to be fixed.
  • Structural Mapping: Next, you draw up the blueprint. This is where you decide which columns in your source tables map to which columns in the target CDM. For example, you'd map your EHR’s patient_demographics table to the PERSON table in OMOP.
  • Vocabulary Mapping: Here’s the big one. This is the notoriously difficult step of translating all your local, proprietary codes (like internal lab codes) into the standard terminologies the CDM requires, such as SNOMED CT, LOINC, or RxNorm.

The sheer scale of this challenge becomes clear when you consider HL7 version 2. This data standard is the lifeblood of hospital operations, used in over 35 countries and by an incredible 95% of US healthcare organizations. What this means for data engineers is that the road to a CDM almost always starts with parsing complex HL7v2 messages. This makes having a robust vocabulary mapping strategy an absolute necessity. You can explore the full research on HL7v2's dominance to see just how widespread it is.

Moving Beyond Manual Spreadsheets

For years, vocabulary mapping was a manual nightmare. Teams would spend countless hours hunched over massive spreadsheets, painstakingly looking up thousands of local codes and trying to find the right standard concept. This approach wasn't just slow—it was a recipe for errors and inconsistencies that could undermine the integrity of the entire dataset.

The flowchart below shows the high-level decision-making process that comes before this intensive ETL work begins. It reinforces that a clear goal is the foundation for the entire project.

Flowchart outlining the CDM selection process with steps: Define Goal, Compare Options, and Select & Implement.

Choosing a CDM is a strategic process in itself, but once you’ve made your pick, the focus shifts entirely to the technical execution of the ETL.

The move from manual mapping to programmatic, API-driven workflows is the single biggest leap forward for modern CDM implementation. It replaces manual labor and guesswork with speed, accuracy, and reproducibility.

Instead of wrestling with messy spreadsheets, developers can now write scripts that call an API to perform these lookups in an instant. This simple change turns a major bottleneck into a fast, automated, and scalable part of the data pipeline.

Solving Real-World ETL Problems

Even with better tools, some stubborn challenges remain. You have to track data provenance—knowing where every data point came from and how it was changed—to ensure your research is valid. Maintaining mapping consistency across different data sources and over time is another huge hurdle.

This is where dedicated developer tools and clear documentation really prove their worth. For instance, a service like OMOPHub is built specifically to make vocabulary mapping less painful by providing API-based tools. Having clear guides and code examples allows developers to plug powerful vocabulary services directly into their ETL workflows without reinventing the wheel.

Pro Tip: Use SDKs to speed up your ETL development. Rather than building API clients from scratch, you can use pre-built libraries like the omophub-python SDK or the omophub-R SDK to handle complex vocabulary lookups with just a few lines of code.

By embracing these modern, API-first workflows, data teams can finally get past the traditional roadblocks of ETL. This not only gets the data ready for analysis but also paves the way for a new generation of tools that can manage the entire journey, from raw file to research-ready database.

Speeding Up Vocabulary Mapping with Modern Developer Tools

Anyone who has wrestled with transforming raw source files into a research-ready common data model knows the real bottleneck: vocabulary mapping. The traditional approach is an operational headache. You end up managing local databases for massive terminologies like SNOMED CT, LOINC, and RxNorm, fighting with manual version control, and sinking engineering hours just into keeping the lights on.

Frankly, this old-school method is a dead end. It creates a slow, brittle system that holds back your entire ETL development process and pulls focus from the real goal—actually using the data to generate insights.

It's Time to Move from Local Databases to APIs

The answer is to get out of the database management business and switch to a simple REST API and developer-friendly SDKs. Instead of maintaining your own vocabulary database, your ETL script makes a direct call to a service that manages all that complexity behind the scenes. This is a fundamental shift that pays huge dividends for any data team.

An API-first approach means your team is always working with the most current vocabularies, and development cycles get a whole lot faster. The OMOP Common Data Model, a true workhorse in observational health, really highlights why this efficiency matters. By standardizing, redundant ETL work is slashed, cutting data prep time from months down to weeks.

Actionable Examples for Your ETL Pipeline

With the right tools, tasks that once took days of manual lookups can be done in just a few lines of code. By bringing the work directly into your script, this programmatic approach makes your ETL process faster, more reliable, and far easier to debug.

For instance, a developer can map a local billing code to a standard SNOMED concept without ever leaving their code editor. Here’s a quick look at what a basic concept search looks like with the omophub-python SDK.

from omophub import Client

# Initialize the client with your API key
client = Client(api_key='YOUR_API_KEY')

# Search for a standard concept
concepts = client.search.concepts(query='Type 2 diabetes mellitus')

for concept in concepts:
    print(f"{concept['concept_name']} (ID: {concept['concept_id']}, Vocab: {concept['vocabulary_id']})")

This short Python script connects to the API and immediately finds a standard clinical concept—a routine but critical vocabulary mapping task. You can find more examples in the full omophub-python SDK. The best part? There’s no database to maintain, and you’re guaranteed to be using the latest ATHENA vocabulary versions.

It’s just as straightforward for R developers. The example below uses the omophub-R SDK to pull all the trade names associated with a single generic drug concept—a much more complex query.

library(omophub)

# Configure your API key
Sys.setenv(OMOPHUB_API_KEY = "YOUR_API_KEY")

# Find relationships for a concept (e.g., RxNorm 'Lisinopril')
relationships <- get_concept_relationships(concept_id = 19075766, relationship_id = 'RxNorm has tradename')

# Print the trade names
print(relationships)

Here, the R script is navigating concept relationships to build out a more complete picture of a drug, a key function for advanced tasks like semantic mapping. You can check out more use cases in the omophub-R SDK. This is the kind of automation that makes sophisticated mapping truly scalable.

Tips for Efficient Vocabulary Management

Integrating developer tools like these can completely change your ETL workflow. Here are a few tips for getting the most out of an API-driven approach.

  1. Automate Your Versioning: Use a service that automatically stays in sync with the latest ATHENA vocabulary releases. This eliminates the tedious and error-prone job of manual updates.
  2. Validate Mappings Programmatically: Build checks directly into your ETL scripts. After mapping a local code, you can hit the API again to validate its domain, class, and relationships, catching potential errors before they pollute your data.
  3. Explore Interactively First: Before diving into code, use a web tool for quick exploration. The OMOPHub Concept Lookup is perfect for this, letting both developers and non-programmers test out mapping ideas without writing a single line of code.

By leaning on modern SDKs and APIs, you stop managing vocabulary databases and start building intelligent data pipelines. It’s a shift that lets your team focus on creating value, not on maintaining the plumbing.

Ultimately, these tools offer a direct path to faster, more reliable, and more scalable CDM implementations. To get started, take a look at the official OMOPHub documentation for detailed guides and API references that can help you accelerate your next project.

The Future is Harmonization: Getting CDMs and FHIR to Work Together

Diagram illustrating the conversion and flow of medical data from OMOP, PCORnet, and i2b2 to FHIR.

While different common data models each have their strengths, the real breakthrough in healthcare data isn't about picking a single winner. It’s about making them all work together. This is the core idea of harmonization—creating a "Rosetta Stone" that allows data to move between different analytical models and the real-time systems used in clinical care. We're finally shifting from a competitive mindset to building a truly connected data ecosystem.

A major federal initiative, involving the FDA, NIH, and ONC, is already paving the way. This project successfully mapped major CDMs like OMOP, Sentinel, and i2b2 to the modern FHIR standard. With pilots and HL7 balloting completed by 2024, we have a clear path forward for genuine interoperability. This opens the door to generating much stronger evidence for both regulatory approvals and day-to-day clinical decisions. You can dive into the findings of this important harmonization project on the HHS website.

FHIR Is Not a Replacement for Common Data Models

It’s a common misconception, but it's crucial to understand that FHIR and CDMs are built for different jobs. They complement each other perfectly. FHIR is an API-first standard designed for real-time data exchange on a single patient—think of a doctor's app pulling a specific lab result. In contrast, analytical CDMs are structured for population-level research across massive, standardized datasets.

The most effective data architecture uses both. You can use FHIR APIs to pull operational data from source systems, then transform that data into an analytics-ready CDM like OMOP. This creates a clean separation between real-time clinical operations and large-scale research.

This two-part approach is really the future of healthcare data infrastructure. It allows organizations to keep the immense analytical power of their chosen CDM while gaining the real-time interoperability that FHIR provides. If you're exploring how to get these standards talking, it's worth understanding how the FHIR API can feed data into OMOP.

Harmonization Tips and Preparing Your Data for the Future

Building a bridge between these different standards depends entirely on having a reliable engine to manage the transformation logic. This is where a dedicated vocabulary service becomes absolutely essential.

  • Establish a Vocabulary 'Source of Truth': To map between FHIR code systems and the terminologies used in your CDM, you need one centralized, version-controlled vocabulary service. This is the only way to guarantee consistency across all your data pipelines.
  • Use Tools for Transformation Logic: Avoid the trap of hard-coding mapping rules. Instead, lean on developer tools that can programmatically build and maintain this logic. For example, an SDK like the omophub-python SDK can handle the complex vocabulary lookups required to translate between standards.
  • Plan for a Multi-Model World: It's a mistake to assume you'll only ever use one standard. By building your data architecture around a flexible vocabulary service, you give yourself the ability to adapt as new standards emerge or as your organization's needs evolve.

Ultimately, a managed vocabulary service like OMOPHub acts as the central engine for this new, harmonized world. It takes on the immense complexity of maintaining and cross-walking different terminologies, freeing up your team to focus on what matters: building the bridges between systems. This approach doesn't just speed up development; it makes your organization's data strategy more resilient and ready for a more connected healthcare future.

Common Questions on Implementing a Data Model

Even with a perfect plan on paper, the road from raw electronic health records to a clean, analytics-ready dataset is never a straight line. When it comes time to actually implement a common data model, teams always run into the same practical hurdles. Here are some direct answers to the questions we see pop up time and time again.

How Long Does It Really Take to Convert Data to the OMOP CDM?

There’s no single answer, but we can give you a realistic range. A small, focused project with clean source data might wrap up in a few weeks. However, a large-scale enterprise conversion, especially one pulling from multiple legacy systems, will almost certainly take six to twelve months.

The biggest time sink, without a doubt, is vocabulary mapping. This is where the real work happens—translating all your local, proprietary codes into standard terminologies. We consistently see this phase consume 60-70% of the total project effort. Manually wrestling with thousands of unique lab codes, billing terms, and local drug names is a notoriously slow, error-prone grind that can stall a project for months.

Expert Insight: This is where modern tooling can completely change the game. Instead of relying on manual spreadsheets and local databases, API-driven services automate the tedious work of concept lookups and relationship mapping. This approach can shrink your vocabulary mapping timeline from months down to weeks and lets your team embed the logic directly into your ETL scripts using tools like the omophub-python SDK.

Should We Ever Use More Than One Common Data Model?

Yes, and for many larger organizations, it’s actually a smart strategy. Different common data models are built for different jobs. It’s not about finding the one "perfect" model but using the right tool for the task at hand.

For example, a research hospital might find that i2b2 is unbeatable for its speed in cohort discovery and clinical trial recruitment. The same hospital could then use OMOP for its strengths in conducting massive, longitudinal observational studies across a network.

The secret to making a multi-model approach work is ironclad data governance. You need a centralized vocabulary service to act as the single source of truth, ensuring that a "myocardial infarction" means the same thing in your i2b2 database as it does in your OMOP instance. As harmonization with standards like FHIR improves, moving data between these models is becoming much more straightforward.

What Are the Biggest Mistakes We Should Avoid?

Over the years, we've seen a few common missteps that can quickly derail a CDM project. Being aware of them from the start can save you a world of headaches.

  • Underestimating Vocabulary Mapping: We can't say it enough. This is the most complex, time-intensive part of the entire process. Teams that don't allocate enough time or the right tools for mapping almost always face major delays.
  • Lacking Strong Data Governance: You must have clear, documented rules for how data is mapped, cleaned, and managed. Without this, you’ll produce inconsistent, unreliable data that no one will trust.
  • Treating ETL as a One-Time Project: Your data is always changing. New codes are created, standards are updated, and source systems evolve. Your ETL pipeline must be designed as a living, breathing workflow you can maintain, not a task you check off a list once.
  • Building Your Own Vocabulary Server: It’s tempting to think you can just download SNOMED and LOINC and stand up your own database. This is a classic mistake. It diverts skilled engineers from their real job and saddles you with the thankless task of managing constant terminology updates. Use a managed service instead.

A great way to gauge your project's complexity early on is to use a free tool like the OMOPHub Concept Lookup. It will give you a quick, realistic preview of the mapping challenges ahead.

If We're Adopting FHIR, Do We Still Need a CDM?

Absolutely. This is a common point of confusion, but FHIR and common data models serve very different—and complementary—purposes.

Think of it this way: FHIR is an interoperability standard built for exchange. It’s designed for real-time, one-patient-at-a-time transactions, like a mobile app pulling a single patient's latest lab results. A CDM like OMOP is an analytics standard. It's designed for population-level research on huge, standardized datasets containing millions of patients.

A powerful, modern data architecture uses both. FHIR serves as the pipeline, pulling data in a standard format from various source systems. From there, an ETL process maps that FHIR data into an analytics-ready CDM like OMOP. This gives you the best of both worlds: real-time data access and powerful population-level insight. This isn't just theory; the federal CDMH project provides official guidance and mappings, confirming that FHIR and CDMs work best as partners. You can find more detailed guides in the official OMOPHub documentation.


Ready to eliminate vocabulary database headaches and accelerate your CDM implementation? OMOPHub provides instant REST API access to all OHDSI ATHENA vocabularies with production-ready SDKs for Python, R, and TypeScript. Build faster, ensure compliance, and focus on generating insights—not on managing infrastructure. Learn more at https://omophub.com.

Share: