
Your Epic team finally gives you access. Research has a cohort definition ready. Analytics wants outcomes by next quarter. Then the substantive work begins.
Across health systems, omop epic projects rarely fail because data is unavailable. They stall because the data is inconsistent, the local codes are messy, and every shortcut taken early shows up later in validation. The hard part isn’t pulling rows out of Epic. The hard part is turning those rows into a trustworthy OMOP dataset that another analyst, another site, and another study can reuse without reinterpreting your ETL.
The teams that move fastest usually make one decision early. They treat vocabulary mapping as an engineering problem, not a spreadsheet exercise.
The Challenge of Unlocking Epic Data with OMOP
A common pattern looks like this. One health system stores diagnoses in expected places, another leans heavily on local build, and a third has years of reporting-layer conventions that nobody documented well. All three run Epic. None of them describe the same clinical event in exactly the same way.

That’s why OMOP has become the practical target model for multi-site research and reusable analytics. The OMOP Common Data Model started in 2008 and the OHDSI community adopted it in 2014, growing into a network with over 2,000 collaborators across 74 countries and standardized data for about 800 million unique patients, according to the OMOP CDM background from OMOPHub. In practice, that gives Epic teams a shared destination model with tables such as PERSON, CONDITION_OCCURRENCE, DRUG_EXPOSURE, and OBSERVATION_PERIOD, plus standard vocabularies like SNOMED CT, LOINC, and RxNorm.
Where omop epic projects really slow down
Extracting from Epic is important, but it’s not the longest pole in the tent. Mapping is.
A medication can arrive with local identifiers. A diagnosis can carry source ICD coding plus a site-specific classification. Lab records may include a LOINC in one place, a local analyzer code in another, and a display name that changed after an upgrade. If you don’t normalize those consistently, your ETL might finish, but your OMOP instance won’t be analytically stable.
The industry has already shown that Epic-to-OMOP pipelines can scale in real environments. Integration examples include Stanford’s STARR-OMOP repository and OHDSI’s “OMOP Anywhere” work, and a 2023 OHDSI Symposium showcase highlighted near-real-time OMOP refreshes from Epic native tools. That same showcase notes that ETL effort can be reduced by up to 90% in some workflows through API-driven tooling, as described in the OHDSI 2023 showcase entry.
Practical rule: If your mapping logic lives only in analysts’ notebooks or spreadsheet tabs, you don’t have a production ETL. You have a temporary translation layer.
Why the bottleneck matters
When mapping is slow, every downstream activity slows with it:
- Phenotype development gets delayed because standard concept selection isn’t settled.
- Validation gets noisier because unmapped or incorrectly mapped rows pile up in domain tables.
- Incremental refreshes become risky because nobody wants to rerun a fragile mapping workflow every day.
- Cross-site comparisons break down because two institutions may transform similar Epic source records into different standard concepts.
The fix isn’t just “use OMOP.” The fix is to design your omop epic pipeline so mapping becomes testable, repeatable, and versioned like the rest of the ETL.
Architecting Your Epic to OMOP ETL Foundation
Most rework in Epic-to-OMOP comes from weak planning, not from hard SQL. If the source strategy is fuzzy, the rest of the build inherits that ambiguity.

Choose the right Epic source layer
For most implementations, the first architectural decision is whether your ETL anchors on reporting data or operational extracts. In practice, many teams gravitate toward Caboodle or Clarity-adjacent reporting structures because they’re more stable for shareable ETL logic than highly customized operational pathways.
Use this decision frame:
| Source option | Best fit | Watch for |
|---|---|---|
| Caboodle | Shared ETL patterns, repeatable analytics pipelines, easier institutional handoff | Reporting lag, local remodels, derived fields that hide source nuance |
| Clarity or direct source extracts | Detailed event fidelity, edge-case analysis, source reconciliation | More local variability, more custom joins, more brittle ETL over time |
If your team is still aligning these source layers, this walkthrough of the Epic Clarity data model is useful because it helps separate source-system reality from reporting assumptions.
Profile first, then design
OHDSI’s planning tools still hold up because they force discipline before coding. The reproducible methodology described in the Epic-to-OMOP implementation literature uses WhiteRabbit for source profiling and Rabbit-in-a-Hat for ETL specification. In production settings, that structured approach has yielded 97% to 98% data quality scores and reduced ETL runs to a few hours, according to the published Epic-to-OMOP methodology paper.
WhiteRabbit answers basic but essential questions:
- Which columns are populated
- Which values repeat enough to justify mapping rules
- Where local codes hide inside free-text or overloaded fields
- Which date fields can support event timing and observation logic
Rabbit-in-a-Hat turns that profile into a visual contract. You map source columns to OMOP targets, note transformations, and expose missing pieces before anyone writes production SQL.
Good ETL specs remove debates from sprint meetings. They make assumptions visible while they’re still cheap to fix.
A six-step foundation that works
I’ve had the best results when teams formalize the design in this order:
-
Profile the source
Run WhiteRabbit on the actual Epic extract you’ll transform, not a sample somebody saved months ago. -
Mark authoritative fields
Decide which diagnosis date, medication date, and encounter date will win when duplicates or alternatives exist. -
Draw table-level mappings
Use Rabbit-in-a-Hat to document which Epic structures feed PERSON, VISIT_OCCURRENCE, CONDITION_OCCURRENCE, MEASUREMENT, and other core OMOP tables. -
Separate structural mapping from vocabulary mapping
One answers where a field lands. The other answers which standard concept it becomes. Mixing them too early creates confusion. -
Define reload behavior
Decide whether each target table is append-only, merge-based, or rebuilt in partitions. -
Write validation criteria before coding
Agree on what counts as complete, conformant, and plausible before the first production load.
Teams that want a broader engineering pattern can also borrow ideas from this guide on building a solid data pipeline. The framing is general, but the core habits apply directly to omop epic work: explicit contracts, staged data, observability, and rerunnable jobs.
What not to do
Some choices almost always cost time later:
- Don’t start with domain tables only. If you haven’t modeled person identity, visit grain, and observation logic, your downstream domains won’t reconcile.
- Don’t trust local field names. “Diagnosis date” often means several different things depending on the build.
- Don’t let analysts define mappings only in slide decks. Put ETL rules where engineers can execute and diff them.
That foundation determines whether your project becomes reusable infrastructure or a one-off conversion.
Accelerating Vocabulary Mapping with OMOPHub
Most omop epic teams underestimate mapping because the first few examples look easy. One ICD-10 code maps cleanly. One LOINC resolves fast. Then the long tail arrives. Deprecated codes, ambiguous local terms, mixed vocabularies, and source systems that changed naming conventions over time.

Traditional mapping methods break at scale
Spreadsheets are still common for first-pass mapping, especially when clinical reviewers want to inspect candidate concepts. They’re fine for triage. They’re poor as a system of record.
Usagi is useful for semi-automated mapping and remains part of many OHDSI workflows. But once you need repeatable runs, CI checks, release management, and programmatic reconciliation across institutions, local desktop tooling starts to show limits. You need mappings to behave like code.
The reason this matters goes beyond convenience. OMOP’s value depends on standardized vocabularies and reproducible concept use across data sources. The OMOP model began in 2008, OHDSI adopted it in 2014, and the network now spans data on 800 million patients, with SNOMED CT, LOINC, and RxNorm acting as the backbone for consistent concept sets, as summarized in this OMOP concept and vocabulary overview.
The developer-first pattern
The more reliable pattern is simple:
- stage source codes as they appear in Epic
- call a vocabulary service programmatically
- store candidate concepts with status metadata
- review exceptions
- publish a versioned mapping table for ETL use
For quick debugging, a browser-based lookup is still handy. During mapping reviews, the OMOPHub Concept Lookup tool is useful for interactive checks before you encode the final rule.
The same workflow can be automated with the Python and R SDKs documented in the OMOPHub documentation, with client libraries available for Python and R. OMOPHub provides a REST API for ATHENA vocabulary access, version handling, concept relationships, and cross-vocabulary lookup. In an Epic ETL, that means the mapping tier can run as a service instead of a manual side process.
Working advice: Don’t write standard concept IDs directly into transformation SQL from ad hoc analyst notes. Generate a mapping table and make every domain load join to it.
Python example for diagnosis mapping
A practical pattern is to build a source-to-standard mapping table before loading OMOP domains.
from omophub import OMOPHub
client = OMOPHub(api_key="YOUR_API_KEY")
source_codes = [
{"source_code": "E11.9", "vocabulary": "ICD10CM"},
{"source_code": "I10", "vocabulary": "ICD10CM"},
{"source_code": "R79.9", "vocabulary": "ICD10CM"},
]
mapping_rows = []
for item in source_codes:
result = client.concepts.map_code(
code=item["source_code"],
vocabulary_id=item["vocabulary"]
)
mapping_rows.append({
"source_code": item["source_code"],
"source_vocabulary_id": item["vocabulary"],
"standard_concept_id": result.get("standard_concept_id"),
"standard_concept_name": result.get("standard_concept_name"),
"mapping_status": result.get("status"),
"valid_start_date": result.get("valid_start_date"),
"valid_end_date": result.get("valid_end_date")
})
for row in mapping_rows:
print(row)
What this pattern gets right:
- It preserves the source code so analysts can audit the original Epic value.
- It stores status metadata so deprecated or invalid mappings don’t go unnoticed.
- It externalizes mapping into a table that every ETL step can join consistently.
R example for lab vocabulary work
Labs often need more iteration because local build varies more than diagnoses. A staged R workflow keeps that manageable.
library(omophub)
client <- omophub_client(api_key = "YOUR_API_KEY")
codes <- data.frame(
source_code = c("4548-4", "17856-6"),
vocabulary_id = c("LOINC", "LOINC"),
stringsAsFactors = FALSE
)
mapped <- lapply(seq_len(nrow(codes)), function(i) {
res <- map_code(
client = client,
code = codes$source_code[i],
vocabulary_id = codes$vocabulary_id[i]
)
data.frame(
source_code = codes$source_code[i],
source_vocabulary_id = codes$vocabulary_id[i],
standard_concept_id = res$standard_concept_id,
standard_concept_name = res$standard_concept_name,
mapping_status = res$status,
stringsAsFactors = FALSE
)
})
mapping_table <- do.call(rbind, mapped)
print(mapping_table)
Build the review queue, not just the happy path
A strong mapping workflow doesn’t aim to automate every decision. It automates the repeatable decisions and isolates the ambiguous ones.
Use three buckets:
| Bucket | Action |
|---|---|
| Direct standard match | Accept and persist to mapping table |
| Deprecated or non-standard | Follow concept relationships or route for review |
| Local or ambiguous source term | Queue for analyst and clinical validation |
That queue is where many teams still use spreadsheets, and that’s fine, as long as the reviewed output gets written back into a durable mapping store used by ETL. The key is that your spreadsheet is a review surface, not the mapping engine.
Tips that save time
- Cache lookups by code and vocabulary version. Repeated diagnosis and lab codes shouldn’t trigger fresh resolution every run.
- Track why a mapping was chosen. Add reviewer notes or rule identifiers for non-obvious local mappings.
- Separate source code normalization from concept mapping. Clean whitespace, punctuation, and casing first.
- Retain source values in OMOP source fields. You’ll need them during validation and audit.
When teams do this well, vocabulary work stops being a bottleneck owned by a few experts and becomes a controlled input to the ETL.
Implementing the Data Transformation Logic
Once the mapping table is stable, the ETL becomes much more mechanical. That’s where you want to be. Transformation logic should be explicit, rerunnable, and boring.
Derive observation periods carefully
OBSERVATION_PERIOD is one of the first places weak source assumptions show up. In Epic data, encounter history usually gives the best starting point, but it still needs rules.
At minimum, every patient in OMOP needs an observation period. OHDSI discussions around Epic-related implementations have reinforced that expectation, with coverage overlaps handled separately through structures like PAYER_PLAN_PERIOD rather than by skipping observation logic.
A practical SQL pattern looks like this:
insert into omop.observation_period (
observation_period_id,
person_id,
observation_period_start_date,
observation_period_end_date,
period_type_concept_id
)
select
nextval('omop.observation_period_seq'),
p.person_id,
min(v.visit_start_date) as observation_period_start_date,
max(v.visit_end_date) as observation_period_end_date,
44814724 as period_type_concept_id
from omop.person p
join staging.epic_visits v
on p.person_source_value = v.patient_id
group by p.person_id;
This works as a baseline, but only if visit data is trustworthy enough to define periods. If your organization has sparse ambulatory history or imported legacy encounters, you may need to supplement with enrollment or registration events.
Keep the derivation rule simple enough to explain to a researcher. If you can’t describe why a patient’s observation starts on that date, the rule is too opaque.
Populate conditions from diagnosis and problem-list sources
For CONDITION_OCCURRENCE, many Epic environments pull from multiple source streams. Problem lists capture chronicity. Encounter diagnoses capture billed or assessed conditions. Historical imports add yet another flavor.
The ETL pattern that holds up is to stage each source independently, standardize date semantics, then union into a condition staging table before loading OMOP.
with diagnosis_stage as (
select
d.patient_id,
d.diagnosis_date as condition_start_date,
d.source_code,
m.standard_concept_id as condition_concept_id,
d.encounter_id,
d.diagnosis_id as source_event_id
from staging.epic_diagnoses d
left join staging.code_map m
on d.source_code = m.source_code
and m.domain_id = 'Condition'
),
problem_stage as (
select
pl.patient_id,
pl.noted_date as condition_start_date,
pl.source_code,
m.standard_concept_id as condition_concept_id,
null as encounter_id,
pl.problem_id as source_event_id
from staging.epic_problem_list pl
left join staging.code_map m
on pl.source_code = m.source_code
and m.domain_id = 'Condition'
)
insert into omop.condition_occurrence (
condition_occurrence_id,
person_id,
condition_concept_id,
condition_start_date,
condition_source_value,
visit_occurrence_id
)
select
nextval('omop.condition_occurrence_seq'),
p.person_id,
coalesce(s.condition_concept_id, 0),
s.condition_start_date,
s.source_code,
v.visit_occurrence_id
from (
select * from diagnosis_stage
union all
select * from problem_stage
) s
join omop.person p
on p.person_source_value = s.patient_id
left join omop.visit_occurrence v
on v.visit_source_value = s.encounter_id;
Key design choice: use your staged mapping table, not inline vocabulary logic. That keeps transformation deterministic.
Handle drug exposure with source-aware rules
DRUG_EXPOSURE usually requires the most source nuance because Epic often stores orders, administrations, and fills differently. Medication orders tell you intent. MAR tells you what was administered. Refill or dispense-like data may exist, but with local variability.
For inpatient settings, MAR records often deserve precedence when both order and administration exist. For ambulatory settings, orders may be the only reliable source. The main mistake is collapsing all medication events into one generic rule.
Use this checklist:
- Orders: Good for prescribed exposure, weaker for actual administration.
- MAR: Best for inpatient administered drugs, but timing and stop logic can be tricky.
- Local med identifiers: Normalize early and join to a drug mapping table, ideally resolving to standard RxNorm concepts.
Common gotchas
- Null concept IDs sneaking through: Defaulting everything to zero without review hides mapping debt.
- Date inflation: Choosing order entry timestamps when the study needs clinical event dates.
- Duplicate events: The same condition or medication can appear through multiple Epic pathways. You need source precedence and dedup rules.
- Source-value loss: If you don’t retain original code and text, debugging later becomes guesswork.
Good transformation code doesn’t try to be clever. It makes source assumptions visible and keeps every major clinical domain tied back to reviewed mappings.
Ensuring Data Integrity Through Validation
A successful load only proves the database accepted your inserts. It doesn’t prove the OMOP data is usable.

Start with characterization, then run rule-based checks
I treat validation in two passes. First, characterize the data with Achilles so the team can see population shape, domain volumes, and obvious anomalies. Second, run the Data Quality Dashboard for conformance, completeness, and plausibility checks.
For teams building frequent refreshes, a written validation runbook matters as much as the tools. This guide on data quality checking is a good reference point for structuring repeatable checks around source retention, mapping review, and post-load reconciliation.
Red flags worth checking every run
You don’t need a giant dashboard to catch the worst problems. Start with a short operational checklist.
-
Patients without observation periods
If a patient appears in clinical domains but not in OBSERVATION_PERIOD, the ETL contract is broken. -
Events before birth or after death
These usually come from wrong source dates, timezone handling, or accidental joins. -
High unmapped source-code volume
Not every local code can be mapped automatically, but unmapped growth is an early warning sign. -
Visit links that disappear
If conditions, procedures, or drugs lose their encounter connection unexpectedly, your visit key strategy probably changed upstream. -
Sudden source-value drift
New Epic build changes often show up first as unfamiliar source values, not as ETL failures.
Validation should answer two questions: did the data load correctly, and did the source system change in a way your ETL didn’t expect?
Daily updates need a tighter feedback loop
One under-covered part of omop epic work is handling daily or near-real-time refreshes from Epic’s native ETL pathways. The gap isn’t whether it’s feasible. It’s that few public resources spell out operational benchmarks and troubleshooting for that cadence. A recent summary of that gap notes growing interest in daily and near-real-time workflows, and points to API-driven vocabulary synchronization with sub-50ms queries as a way to avoid ETL bottlenecks in high-frequency update patterns, as discussed in this PubMed-indexed overview of the implementation gap.
That changes how validation should work. For monthly refreshes, a long manual review cycle may be acceptable. For daily updates, it isn’t. You need smaller checks that run every load and larger audits that run on a schedule.
A practical split looks like this:
| Validation cadence | What to check |
|---|---|
| Every incremental load | row deltas, duplicate detection, unmapped-code counts, failed joins |
| Daily operational review | domain-level trend shifts, source drift, stale extracts |
| Scheduled deep audit | Achilles characterization, DQD review, reviewer follow-up on mapping exceptions |
What works better than heroic cleanup
The cleanest validation programs do three things well:
- Keep raw source payloads or staged extracts available so investigators can trace a suspicious row back to source.
- Version mapping decisions so changes in standard concept assignment are explainable.
- Fail fast on structural defects instead of accepting bad loads and planning to fix them later.
If your validation process depends on one expert spotting anomalies by intuition, it won’t survive production operations.
Optimizing Performance and Ensuring Compliance
Production-grade omop epic pipelines need two qualities at the same time. They have to run fast enough for frequent refreshes, and they have to leave an audit trail clear enough for regulated environments.
Performance comes from boundaries, not tricks
The fastest ETLs usually make one architectural separation early. They decouple extraction from mapping.
For FHIR-enabled Epic-to-OMOP ETL, that pattern has been described as extracting valid FHIR bundles into a replayable raw layer, mapping vocabularies separately, then loading OMOP with reconciliation checks. In that approach, teams have achieved 95% to 98% data quality, incremental loads completing in hours, and 99% idempotency, with guidance to use abstracted vocabulary services for lookups under 50ms, according to the FHIR Epic ETL methodology from OMOPHub.
That separation matters for performance because each layer solves a different problem:
- Extraction layer handles source pagination, timestamps, retries, and raw retention.
- Mapping layer resolves codes, vocabulary versions, and standard concept assignment.
- Load layer merges or repairs OMOP records deterministically.
When teams blend those concerns into one script, debugging gets slow. Reruns get dangerous. Every source-system hiccup becomes a vocabulary problem and every vocabulary issue looks like an extraction defect.
Practical tuning patterns
These habits consistently pay off:
-
Use incremental loads by source watermark
Don’t rebuild the whole OMOP instance if only a small slice of Epic changed. -
Cache concept lookups
Common diagnosis, procedure, and drug codes repeat constantly. Resolve once, reuse many times. -
Design idempotent merges
A rerun should repair the same batch, not create duplicate domain records. -
Partition heavy staging tables
Large event tables become much easier to validate and reload when partitioned by date or extract batch.
A rerunnable ETL is faster than a theoretically efficient ETL that nobody trusts enough to rerun.
Compliance has to be designed into the pipeline
HIPAA and GDPR concerns usually surface late, but the key controls belong in the original architecture.
A compliant pattern for omop epic work usually includes:
| Area | Design choice |
|---|---|
| Auditability | retain raw extracts or payloads, log mapping versions, record reruns |
| Minimum necessary use | limit who can access source-identifiable staging layers |
| Vocabulary traceability | record which terminology version informed each mapping release |
| Replay support | keep enough source history to reproduce a load during audit or investigation |
Teams encounter problems with local scripts and untracked spreadsheet mappings. They may work once. They don’t satisfy a serious audit.
FHIR is useful, but only with discipline
Epic FHIR can simplify extraction contracts because the source interface is more standardized than many direct database pathways. It also gives teams a practical route toward more frequent updates. But FHIR doesn’t eliminate the hard parts. You still need replay-safe extraction, explicit concept mapping, and post-load reconciliation.
What works:
- staging raw FHIR bundles first
- validating source capability before coding assumptions
- preserving source values alongside standard OMOP concepts
- testing reruns against the same bundle set
What doesn’t:
- mapping inline during extraction
- discarding raw payloads after load
- assuming one Epic environment behaves exactly like another
- treating idempotency as optional
A strong production pipeline doesn’t chase elegance. It enforces boundaries, keeps evidence, and makes every vocabulary decision reproducible.
If your team is trying to reduce the vocabulary-mapping bottleneck in an Epic-to-OMOP build, OMOPHub is one option for accessing OMOP vocabularies through an API instead of maintaining a local vocabulary database. That can fit well when you want programmatic mapping, version-aware terminology access, and a workflow that’s easier to embed in repeatable ETL jobs.


