The OMOP Common Data Model (CDM) is a shared language for healthcare data. It’s a standard that takes wildly different information from sources like electronic health records, insurance claims, and patient registries and reshapes it into a single, consistent, and predictable structure. This standardization is the key to unlocking reliable, large-scale research that can be reproduced across different organizations and even countries.

What Is the OMOP Common Data Model

Healthcare data flow diagram: EHR, insurance claims, patient registries, and a database discussed by doctors.

If you've ever worked with healthcare data, you know the biggest headache: every system speaks its own language. A diagnosis code in one hospital's EHR might not match another's, and the way a patient's lab results are stored can vary completely from one database to the next. It’s like trying to plug an American device into a European outlet-the structures just don't fit.

The OMOP Common Data Model is the universal adapter in this scenario. It provides a standard structure and a common set of vocabularies to transform that messy, siloed data into a uniform format. Once your data is in the OMOP CDM, you can "plug in" your analytical tools and run studies on any other OMOP-compliant database in the world, knowing you're comparing apples to apples.

This is what makes massive, multi-site studies finally possible. By standardizing not just the data structure but the meaning behind the data, OMOP ensures that a "condition" in a database from Spain means the exact same thing as a "condition" in a database from South Korea. This is the foundation for generating real-world evidence you can actually trust.

To better understand the shift OMOP represents, it's helpful to compare it directly with the conventional ways we've handled healthcare data.

OMOP vs Traditional Healthcare Data Approaches

Challenge	Traditional Approach	OMOP CDM Solution
Data Inconsistency	Data is stored in proprietary formats unique to each source (EHR, claims). "Blood pressure" might be recorded differently everywhere.	A single, standardized structure (the "common data model") ensures data is stored uniformly across all sources.
Semantic Differences	The same concept (e.g., Type 2 Diabetes) is represented by dozens of different codes (ICD-9, ICD-10, SNOMED CT).	Standardized vocabularies map all local source codes to a single, common "standard concept ID" for each clinical idea.
Lack of Reproducibility	An analysis written for one hospital's data must be completely rewritten for another, making validation nearly impossible.	A single analytical script can be run on any OMOP-compliant database worldwide without modification, enabling true reproducibility.
Inefficient Collaboration	Sharing and combining data for large studies is slow, costly, and raises major privacy concerns.	A federated network model is used. The analysis code is sent to the data, and only aggregated results are returned, protecting patient privacy.

This shift isn't just an academic exercise; it's a practical solution born out of real-world needs, driven by a massive global community.

The Collaborative Power Behind OMOP

The OMOP CDM isn't some corporate product. It's an open-source, community-driven initiative stewarded by the Observational Health Data Sciences and Informatics (OHDSI) community. This is a global network of researchers, data scientists, and clinicians all working together to refine and expand the model.

The project’s roots go back to 2008, when it was first funded by the US Food and Drug Administration (FDA) to find better ways to monitor drug safety. Today, the OHDSI network has grown to over 2,000 collaborators across 74 countries. Together, they maintain standardized data on roughly 800 million patients-a truly staggering scale that demonstrates how OMOP is making sense of petabytes of previously unusable information. You can learn more about this journey in this detailed study on OHDSI's global network and its impact.

Accelerating Evidence-Based Medicine

By creating a common structure and vocabulary, the OMOP CDM allows analytical code to be shared and executed across a network of databases. A researcher can write a single query to study a drug's side effects and run it on data from hospitals in Asia, Europe, and North America without rewriting a single line of code.

This fundamentally changes how quickly we can generate evidence. It works by:

Enabling Large-Scale Studies: Combining data from millions of patients gives us the statistical power to spot rare drug side effects or identify which treatments work best for specific populations.
Promoting Reproducibility: Because the analyses are standardized, other researchers can easily run the same code on their own data to validate the findings. This is a cornerstone of good science.
Protecting Patient Privacy: The model is designed for a federated approach. Instead of pooling sensitive data, the analytical code is sent to each institution. The data never leaves its secure environment; only anonymous, aggregated results are returned.

Tip for Getting Started: The best way to grasp how this works is to see the vocabularies in action. Try using a free tool like the OMOPHub Concept Lookup to search for a medical code you're familiar with, like an ICD-10 code. You'll see how it's mapped to a standard concept, which is the "universal language" OMOP uses. It’s a great hands-on way to understand the core of the model.

Navigating the Standardized Vocabularies

Digital tablet displays a healthcare data model with interconnected RxNorm and SNOMED terms, touched by a hand.

If the data tables are the grammar of the OMOP Common Data Model, the Standardized Vocabularies are its shared dictionary. Think of them as the “Rosetta Stone” of health data-the critical component that translates thousands of different coding systems into a single, unified language. Without this, true analysis across different datasets would be impossible.

Just imagine the chaos. One hospital might log "Type 2 diabetes mellitus" with the ICD-10-CM code E11.9. A research database could have the same condition under the older ICD-9 code 250.00. An international partner might even use a completely different national code. To a computer, these are just meaningless strings, but clinically, they all point to the exact same disease.

This is the problem OMOP's vocabularies were designed to solve through a process called mapping. Every code from a source system-whether it's ICD-10, LOINC, or a proprietary EHR terminology-is mapped to a single, authoritative standard concept. This ensures that no matter how a diagnosis, drug, or procedure was first recorded, it all resolves to the same identifier in the final OMOP database.

The Power of Standard Concepts

The end goal is to represent every unique clinical idea with a single, unambiguous number. For conditions, the preferred standard is SNOMED CT; for drugs, it's RxNorm.

Let's revisit our diabetes example to see this in action:

A patient record with the ICD-10-CM code E11.9 is mapped to SNOMED Concept ID 44054006.
Another record with the ICD-9-CM code 250.00 is also mapped to SNOMED Concept ID 44054006.

Now, a researcher can simply query for Concept ID 44054006 and instantly retrieve every single patient with Type 2 diabetes. The original coding system becomes irrelevant for the analysis. This is the fundamental mechanism that makes large-scale, federated research possible with OMOP.

But the vocabularies are much more than a simple code-to-code translation service. They build a rich network of relationships. A concept for a specific drug like "Metformin 500mg Oral Tablet" is hierarchically linked to its ingredients ("Metformin") and its therapeutic class ("Biguanides"). This lets you ask much bigger questions, like "Show me all patients on any Biguanide drug."

Practical Tips for Vocabulary Management

Working with vocabularies is a continuous effort, not a one-and-done setup. As you get your hands dirty with OMOP, keep these core principles in mind.

Always Map to Standard Concepts: This is the golden rule. Your goal is to populate the _concept_id fields (like condition_concept_id) with standard concept IDs. Just as importantly, you should preserve the original code in the _source_concept_id field for traceability and validation.
Handle Non-Standard Mappings: What happens when a source code has no direct standard equivalent? The common practice is to map the code to itself (often called a "0-map," where the concept ID is 0 or the source concept ID is used instead). While this data won't be usable in federated network studies, it’s still preserved and valuable for internal analyses.
Use Programmatic Access: Looking up thousands of codes by hand is a non-starter. You need to integrate vocabulary lookups directly into your ETL scripts using tools like the OMOPHub SDKs for Python or R. This automates the mapping process, saving countless hours and reducing human error.
Stay Updated: Vocabularies aren't static. New medical codes are created and mappings are refined all the time. You'll need a reliable process to refresh your vocabulary tables regularly to ensure your data mappings remain accurate and current.

Mastering the vocabularies is how you move from just storing data to truly unlocking its analytical power. If you want to dive deeper into specific terminologies, you can learn more about SNOMED CT code lookup in our detailed guide.

A Developer's Tour of the Core Data Tables

To really get your head around the OMOP Common Data Model, you have to stop thinking in hypotheticals and start looking at how a patient's story gets built, piece by piece, across the core tables. This isn't just some random collection of database tables. It’s a carefully designed framework for capturing the who, what, where, and when of every single patient interaction.

Think of it like being a detective assembling a case file from scattered notes, lab reports, and witness statements. You need a system to organize it all, and that's what OMOP provides.

At the center of it all are the Clinical Data Tables. These are the event logs of a patient’s journey through the healthcare system. Each table in this group is built to hold a specific kind of clinical fact-a diagnosis, a prescription, a lab result. They answer the fundamental questions: "What happened?" and "When did it happen?"

But events don't happen in a vacuum. That's where the Health System Data Tables come in. They provide the crucial context, describing the environment where the clinical events took place. They answer, "Who delivered the care?" and "Where was it provided?" This layer connects the clinical data to real-world providers, clinics, and hospitals.

And finally, the Vocabulary Tables we've touched on before serve as the universal dictionary, making sense of all the different codes and terms used across the clinical tables. Together, these three table groups are the machinery that turns messy source data into a complete, analyzable patient record.

The Clinical Event Tables

This is where the action is. Each clinical event table is designed to capture a distinct "domain" of information. A single trip to a primary care doctor can easily generate records in half a dozen of these tables.

For instance, a diagnosis of hypertension goes into CONDITION_OCCURRENCE. The prescription for lisinopril that follows gets logged in DRUG_EXPOSURE. And the blood pressure reading taken by the nurse ends up in MEASUREMENT.

This "one event, one domain" principle is what makes the OMOP CDM so powerful for analysis. It forces a clean separation of concerns, which means you can write queries to find specific types of clinical facts without having to hunt through multiple, ambiguous tables.

Let's walk through the most common ones. Below is a simple breakdown of the main clinical tables, what they're for, and the kind of source data you'd typically map into them.

Key OMOP Clinical Data Tables and Their Purpose

Table Name	Primary Purpose	Example Source Data
CONDITION_OCCURRENCE	Captures diagnoses, signs, and symptoms. This is the patient's "problem list."	ICD-10-CM diagnosis codes from billing records, problem lists from an EHR.
DRUG_EXPOSURE	Logs every medication a patient is exposed to, from prescriptions to hospital administrations.	Pharmacy dispensing records (NDC codes), e-prescribing data, inpatient MAR.
PROCEDURE_OCCURRENCE	Contains records of surgeries, therapies, and other medical procedures.	CPT-4 or HCPCS codes for procedures from claims data, surgical scheduling systems.
MEASUREMENT	Holds structured data with a numeric value and unit, primarily lab results and vital signs.	Lab results with LOINC codes, vital sign flowsheets (e.g., blood pressure, BMI).
OBSERVATION	A flexible "catch-all" for clinical facts that don't fit neatly elsewhere.	Social history (smoking status), family history, survey responses, data from notes.

These tables are the workhorses of the CDM. While this covers the heavy hitters, you can find a complete list of all tables and their detailed specifications in the official OMOP CDM Documentation, which is always the ultimate source of truth.

The Person and Health System Tables

Clinical data is just a stream of facts without context. The Person and Health System tables are what anchor these events to a specific person, place, and time.

PERSON: This is the master table. Every record represents one unique individual, and every clinical event table ultimately links back to a person_id. It’s the spine of the entire model.
PROVIDER: This table holds the "who." It contains information about the individual healthcare professionals, like physicians and nurses, who deliver care.
CARE_SITE: This is the "where." It represents the physical locations where care happens, from a small private practice to a massive academic medical center.
VISIT_OCCURRENCE: This table chronicles a patient's encounters with the healthcare system, such as an inpatient hospital stay or a simple outpatient visit. It acts as a container, linking together all the different events that occurred during a specific encounter.

Putting It All Together: A Practical Example

Let's see how this works in practice. Imagine a patient, Jane Doe (person_id = 101), goes to her annual check-up (visit_occurrence_id = 404) with Dr. Smith (provider_id = 202) at General Hospital (care_site_id = 303).

Dr. Smith diagnoses her with hyperlipidemia, recorded with the ICD-10 code E78.5. This event creates a new row in the CONDITION_OCCURRENCE table. The record links to Jane (person_id = 101) and her visit (visit_occurrence_id = 404), and the ETL process maps the source code E78.5 to its standard SNOMED concept.
Next, she is prescribed Atorvastatin 20mg. This generates a record in the DRUG_EXPOSURE table, also linking to Jane and her visit. The medication is identified by its standard RxNorm concept.
Finally, a blood panel is ordered. The "Total Cholesterol" result of 240 mg/dL gets stored as a new row in the MEASUREMENT table, again linked to the same person and encounter.

This step-by-step process shows how scattered pieces of information from a single visit are systematically filed away inside the OMOP CDM. It’s this transformation-from a simple narrative into a collection of standardized, interconnected data points-that lays the foundation for all powerful, large-scale research.

Mastering the ETL Process to OMOP

Let's be honest: getting your raw, messy source data into the OMOP Common Data Model is where the real work happens. This journey, known as the Extract, Transform, and Load (ETL) process, is the pragmatic, often grueling, task of turning chaotic healthcare data into a powerful analytical asset. It’s a job that demands careful planning, a deep understanding of the data's clinical context, and a methodical approach.

The ETL process is far more than just moving data from point A to point B. It's about fundamental translation and reconstruction. You're taking local codes, messy units, and narrative text and reshaping them into a coherent, standardized patient story that can be analyzed at scale. This transformation is precisely what unlocks the powerful, reproducible research OMOP is built for.

Think of it like piecing together a patient's complete story from scattered clues. A doctor's note, a lab result, and a new prescription are all individual data points. The ETL process organizes each of these into the right place within the OMOP framework.

A patient's story process flow showing doctor's note, lab result, and prescription steps.

As the diagram shows, each piece of source information flows into a specific, structured domain, creating an interconnected patient record that's finally ready for analysis.

A Battle-Tested ETL Workflow

Every successful OMOP conversion follows a predictable path. I've seen teams try to rush these steps, and it almost always leads to poor data quality, endless debugging, and having to start over. Sticking to a structured workflow is your best defense against that.

Source Data Profiling: Before you write a single line of code, get to know your source data intimately. This means analyzing every source table, documenting its structure and contents, and realistically assessing its quality. This is where you'll spot the tricky mapping challenges before they become roadblocks.
Semantic Mapping: This is the heart of the whole operation. Here, you'll create detailed specifications that dictate exactly how each source column gets transformed and where it lands in the OMOP CDM. A huge part of this is mapping your local, source-specific codes to standard concepts using the OMOP vocabularies.
Implementation and Testing: With your map as a blueprint, you can finally start building the ETL scripts. Knowing your way around tools like Python for ETL is a massive advantage for processing and transforming large datasets efficiently. Always start with a small slice of data to test your logic before unleashing it on the full dataset.
Validation and Quality Assurance: Once the data is loaded, you have to prove it's correct. OHDSI provides excellent tools for checking the structural integrity and adherence to CDM conventions. We've also written about how to implement a good QC process in our guide on using a data quality dashboard.

Tackling Common Pitfalls

If you work on enough OMOP ETL projects, you start to see the same hurdles crop up time and time again. Knowing what they are ahead of time can save you a world of pain.

The drive to overcome these challenges is often fueled by the promise of a major productivity boost for data teams. We've seen health systems adopt OMOP and watch their staff's efficiency soar, which in turn helps with talent recruitment and retention. One analysis of OMOP-powered process mining even showed application rates as high as 99.64% for common clinical orders, proving just how precise it can be for real-world event log extraction.

A Critical Pitfall: One of the most common mistakes is deciding the target table (e.g., CONDITION_OCCURRENCE vs. OBSERVATION) based on the name of the source table. The correct approach is always to let the vocabulary guide the mapping. First, map your source code to a standard concept, then use the domain of that standard concept to determine the right destination table.

Other classic challenges include dealing with incomplete records-which might need to be flagged or imputed-and wrestling with complex temporal data, like figuring out the true duration of a drug exposure from messy start and end dates.

Actionable Tips for a Successful ETL

Based on years of experience, here are a few practical tips to guide your implementation and keep you aligned with community best practices.

Start Small. Seriously. Begin with a tiny but representative sample of your data. This lets you iterate fast, test your logic, and fix issues before you try to process millions or billions of records.
Document Everything. Keep a detailed mapping document that records every single decision. Explain why a source field was mapped a certain way and how you handled transformations. This "source-to-target" map will be invaluable for validation, maintenance, and for anyone who inherits your work.
Leverage Community Resources. You are not on an island. The OHDSI community is a treasure trove of knowledge and shared experience. The official documentation at https://docs.omophub.com is your source of truth for table structures and conventions.
Use Programmatic Vocabulary Tools. Looking up concepts manually in a web browser is not a scalable strategy. Integrate vocabulary lookups directly into your ETL scripts using tools like the OMOPHub SDKs for Python and R. They are designed for this exact purpose and will save you countless hours.

Cut Through the Noise: A Developer-First Approach with the OMOPHub API

Anyone who's been in the trenches with an OMOP Common Data Model implementation knows the ETL process is a beast. And one of the biggest, most persistent headaches? Vocabulary management.

Traditionally, this meant every single organization had to download, host, and maintain its own copy of the ATHENA vocabularies. We're talking about a massive, multi-gigabyte undertaking that involves wrangling databases, writing update scripts, and constantly checking for new releases. It’s a huge resource drain.

But what if you could sidestep that entire heavy lift?

This is where a developer-first mindset really changes the game. Instead of building and babysitting cumbersome infrastructure, you can tap into the entire suite of OMOP vocabularies programmatically through a simple REST API. This model completely removes the overhead, freeing up your team to focus on the high-value work-data mapping and analysis-not database administration.

OMOPHub was built on this very idea. Our production-ready SDKs let you plug powerful vocabulary queries directly into your ETL scripts or analytics apps in minutes.

A Faster Path to Vocabulary Integration

The most immediate benefit of an API-driven approach is speed. You're not just saving time; you're eliminating a major bottleneck from your project timeline. Your team can start running complex lookups and traversing concept relationships right away, without the days or weeks of waiting for a database to be provisioned, loaded, and validated.

We designed the OMOPHub platform specifically for this, with SDKs in the languages your data teams live in:

Python: The go-to for data engineering, ETL pipelines, and ML workflows. The SDK is on GitHub at OMOPHub/omophub-python.
R: A favorite among biostatisticians and clinical researchers. You'll find the R SDK on GitHub at OMOPHub/omophub-R.
TypeScript: Perfect for building interactive web apps or modern backend services that need to talk to OMOP vocabularies.

For an ETL developer at an EHR integrator or a data scientist on an AI team, this is a lifesaver. You can query vocabularies like LOINC and RxNorm effortlessly. With automated ATHENA syncs and global edge caching delivering sub-50ms latencies, your pipelines just run faster. Instead of wrestling with a local database, you’re free to build sophisticated cross-vocabulary mappings or train your models, drawing from a global network that powers analysis on hundreds of millions of patient records. If you want to see how top health systems are getting value from OMOP, check out this insightful OHDSI presentation.

Zero Maintenance, Maximum Focus

One of the most powerful advantages of using an API like OMOPHub is the shift to a zero-maintenance model for your vocabularies. All the version management and automatic updates are handled for you. You are always working with the latest official ATHENA releases without lifting a finger.

No more manual downloads. No more running update scripts. No more worrying that your vocabulary is out of sync with the community.

Tip for Developers: Want to see how it works? You can test lookups right now without writing a line of code. Our free OMOPHub Concept Lookup tool lets you explore concepts and their relationships. It’s a great way to get a feel for the data before you even touch the API.

Of course, as you start building with the API, following solid code documentation best practices becomes crucial. Clean, well-documented scripts are easier for the whole team to use and maintain.

Integrating the API in Your Python ETL

The real magic happens when you see it in your code. Let's walk through a common, fundamental task in any OMOP ETL: finding the standard concept for a given source code.

First, just install the SDK and initialize the client with your API key.

pip install omophub
from omophub import OmopHubClient

client = OmopHubClient(api_key="YOUR_API_KEY")

Now for the lookup. Let's say your source data has the ICD-10-CM code I10 for Essential Hypertension. Finding its standard SNOMED concept ID is just a single function call.

# Find the standard concept for an ICD-10-CM source code
results = client.concept.search(
    query="I10",
    vocabulary_id=["ICD10CM"],
    invalid_reason=None
)

# Print the standard concept ID and name
if results.source_to_standard_map:
    standard_concept = results.source_to_standard_map[0]
    print(f"Source Code: I10")
    print(f"Standard Concept ID: {standard_concept.target_concept_id}")
    print(f"Standard Concept Name: {standard_concept.target_concept_name}")
else:
    print("No standard concept found.")

That simple, readable block of code replaces what would have been complex SQL joins against a local database. It’s faster to write, far easier to maintain, and it runs on a high-performance, enterprise-grade infrastructure. This is how you build more robust, efficient ETL pipelines and get to valuable insights faster.

Putting Your OMOP Data to Work

A smiling man at a desk with a laptop, looking at a data chart on a screen, surrounded by colorful watercolor splashes.

You’ve put in the hours. The data has been profiled, mapped, and loaded into your shiny new OMOP instance. This is the moment the real work-and the real fun-begins. What questions can you finally answer now that your data speaks a common language? This is where the OMOP Common Data Model proves its worth, turning your structured data into an engine for generating real-world evidence.

Since the OHDSI collaborative kicked off in 2014, the model has been the foundation for large-scale, distributed analytics that were once just a pipe dream. The proof is in the results: the community has produced over 340 OHDSI-published papers, with the output doubling between 2019 and 2020 alone. These aren't just academic exercises; they drive everything from federated learning to health economics.

In one multi-site project, for instance, analysts found that a federal dataset showed 6.44 outpatient visits per person-month, while a civilian system showed just 2.05. Without a common model, reconciling that difference would be a nightmare. With OMOP, it's a feature, not a bug, allowing researchers to build unbiased cohorts that account for these systemic variations.

The standardized structure is what lets you graduate from basic reporting to asking complex clinical questions that can be answered consistently, whether your data is from a hospital in Boston or a claims database in Berlin.

Use Case 1: Population Characterization

One of the first things you'll want to do is simply get to know your data through population-level characterization. This is all about summarizing the baseline clinical and demographic features of a patient group. It’s how you answer the most fundamental questions.

For example, you might need to know, "What are the most common comorbidities in our newly diagnosed hypertension cohort?" In a pre-OMOP world, this could be a major headache involving mapping dozens of different coding systems. With OMOP, every diagnosis code points to a single standard concept ID, making the query surprisingly straightforward.

Tip for Analysts: A well-defined cohort is the bedrock of any good study. Before you chase a complex hypothesis, always start by characterizing your population to see if it actually matches your clinical assumptions. Our guide on effective cohort study design is a great place to start if you want to dig into the principles.

To find the top 10 conditions in a hypertension cohort, a simplified query might look something like this:

SELECT
  c.concept_name,
  COUNT(DISTINCT co.person_id) AS patient_count
FROM condition_occurrence AS co
JOIN concept AS c ON co.condition_concept_id = c.concept_id
WHERE co.person_id IN (
  -- Subquery to find all patients with a hypertension diagnosis
  SELECT person_id FROM condition_occurrence
  WHERE condition_concept_id = 319635 -- Standard concept for Hypertension
)
GROUP BY c.concept_name
ORDER BY patient_count DESC
LIMIT 10;

This query first isolates the group of patients with hypertension and then simply counts up the other conditions they have.

Use Case 2: Complex Cohort Building

Simple counts are just the beginning. The real power of the OMOP Common Data Model shines when you start building intricate patient cohorts based on sequences of events. The model's design, with its clear domains and standardized concepts, is built for exactly this kind of temporal logic.

Let's say a clinician wants to study a drug's effect on kidney function. The research question might be: "Find all patients who started taking Metformin and subsequently had a Creatinine lab test within 90 days."

Answering this requires connecting different parts of the patient journey-drug exposures and lab measurements-while applying a specific time window. The standardized format makes this a reliable and repeatable process.

Conceptually, a SQL query to build this cohort would follow these steps:

Identify the Cohort Entry Event: Find all DRUG_EXPOSURE records for Metformin using its standard RxNorm concept ID.
Find the Follow-up Event: Look for MEASUREMENT records for a serum Creatinine test, identified by its standard LOINC concept ID.
Apply Temporal Logic: Join the two event sets on person_id and filter for measurements where the measurement_date falls between 1 and 90 days after the drug_exposure_start_date.

This type of sequential query is the engine behind most observational research, enabling studies on everything from treatment pathways to disease progression.

Use Case 3: Population-Level Effect Estimation

The ultimate goal for many is population-level effect estimation, or what the OHDSI community calls a "PLE" study. This is where you compare the real-world outcomes of two or more treatments across a large population to estimate which is safer or more effective.

A classic research question might be: "Among new users of Drug A versus Drug B for atrial fibrillation, what is the relative risk of a major bleeding event?" Answering this isn't as simple as counting events; it requires sophisticated statistical methods to control for the biases inherent in observational data.

This is where the federated network model, powered by OMOP, truly excels. Because every database speaks the same language, researchers can write a single analytical package and distribute it to dozens of participating sites. The analysis runs locally on the data, and only the aggregated, anonymous results are sent back. This approach protects patient privacy while generating evidence from millions of patient lives-a core principle of the OHDSI community.

Frequently Asked Questions About OMOP

As teams start digging into the OMOP Common Data Model, a few practical questions almost always surface. Getting these right early on is key. This section tackles some of the most common stumbling blocks to help you get your implementation on solid ground, faster.

How Do I Choose the Correct Domain for a Source Code?

This is a fundamental part of the ETL mapping process, and the answer isn't found in your source data's table names. The Standard Vocabularies are your single source of truth here.

Your first step is to take a source code and find the standard concept it maps to. That standard concept will have a domain_id-like 'Condition', 'Drug', or 'Measurement'. This domain dictates which clinical event table the data belongs in. If a source code maps to a standard concept in the 'Condition' domain, for instance, that record goes into the CONDITION_OCCURRENCE table.

Tip: You can quickly find a concept's domain without writing any code. A tool like the OMOPHub Concept Lookup lets you paste in a code and instantly see its domain, which is a great way to sanity-check your mapping logic on the fly.

What Is the Difference Between an Observation and a Measurement?

This distinction can feel a bit fuzzy at first, but there’s a pretty clear line to draw. A Measurement is typically a structured finding with a numeric value and a unit. Think of a lab result ('Hemoglobin 14.5 g/dL') or a vital sign ('Systolic Blood Pressure 120 mmHg').

An Observation, on the other hand, is a catch-all for clinical facts that don't have a better home in another domain. These records are often qualitative or lack a specific value-and-unit structure. Examples include social history ('Family history of heart disease') or lifestyle factors ('Patient is a smoker'). When in doubt, ask yourself: does it have a distinct number and unit? If so, it’s almost certainly a Measurement.

Can I Add Custom Tables or Columns to the OMOP CDM?

Yes, you can, but it’s critical to do it the right way. You should never alter the core model's tables or columns directly. Changing them breaks compatibility with the standardized analysis tools developed by the OHDSI community and makes it impossible to participate in network studies.

The officially recommended approach is to create your own separate, custom tables. You can then link these new tables back to the standard ones using shared identifiers like person_id or visit_occurrence_id. This strategy maintains the integrity of the core CDM while giving you the freedom to store extra data unique to your project. You can always review the official structures at the OMOP CDM Documentation.

Stop wrestling with vocabulary databases and start building. With OMOPHub, you get instant REST API access to all OHDSI ATHENA vocabularies, backed by developer-friendly SDKs for Python, R, and TypeScript. Accelerate your ETL and analytics work today at https://omophub.com.

Master the OMOP Common Data Model: Your Guide to Standardized Healthcare Data