SDTM to OMOP: A Practical ETL Guide for Clinical Data

Dr. Lisa MartinezDr. Lisa Martinez
June 17, 2026
15 min read
SDTM to OMOP: A Practical ETL Guide for Clinical Data

You probably have this problem right now. Your trial data is clean, reviewed, and locked in SDTM, but the moment someone asks for cross-study analytics, phenotype work, or reuse inside an OMOP pipeline, everything slows down. The data isn't unusable. It's trapped in a model built for submission rather than broad analysis.

That gap frustrates both sides. Clinical teams assume the hard part is done because the trial database is already standardized. Data engineers know the actual work starts after that, because SDTM to OMOP isn't a file conversion. It's a reconstruction job. You have to preserve timing, semantics, and provenance while moving from a regulatory tabulation model into an analytical one.

The teams that do this well treat the ETL as research infrastructure, not just engineering plumbing. They define rules early, encode them explicitly, validate them like scientific assumptions, and keep vocabulary mapping under tight control.

Why Bridge the Gap Between SDTM and OMOP

A four-step infographic illustrating the process of bridging SDTM clinical trial data into the OMOP standard.

SDTM and OMOP solve different problems. That's the first thing teams need to accept.

SDTM is optimized for regulatory submission. It organizes trial data into domains and variables that are consistent, reviewable, and familiar to regulators. That structure is valuable, but it isn't designed as the easiest substrate for network analysis, phenotype execution, or cross-dataset evidence generation.

OMOP CDM is built for harmonized analytics across institutions and studies. It expects standardized concepts, stable domains, and event-level data that can support repeatable analytical workflows. That difference in design philosophy is exactly why conversion matters.

In July 2020, the OMOP Clinical Trials Working Group proposed formal conventions for converting CDISC SDTM into the OMOP Common Data Model, and it explicitly framed the work as a use case for trial planning optimization and a move from a regulatory-oriented standard toward an analytical model for large-scale evidence generation, as described in the OMOP clinical trials data conventions proposal.

What changes after conversion

Once trial data lands in OMOP with enough fidelity, researchers can reuse it in the same analytical ecosystem they already use for observational data. That changes the economics of secondary use.

Instead of rebuilding study-specific extracts every time, teams can:

  • Run shared analytics across trial and non-trial data using common OMOP tooling.
  • Develop phenotypes against a standardized vocabulary layer rather than a study-specific code set.
  • Compare studies more consistently because the target model imposes stable analytical structure.
  • Support planning work by reusing historical trial data in a format suited to broader analysis.

A lot of organizations underestimate this point. They think the value is interoperability alone. In practice, the bigger win is analytical reuse.

Trial data in SDTM is already standardized. It just isn't standardized for the kind of evidence work most OMOP teams want to do.

Why the bridge is strategic, not cosmetic

A good SDTM to OMOP pipeline doesn't replace SDTM. It creates a second, analytics-ready representation of the same trial evidence. That distinction matters because you're not trying to outsmart the submission standard. You're trying to make it usable in a different operational context.

If your stakeholders still see this as a side project, point them to practical examples of OMOP for clinical trials. The technical effort is real, but the reason to do it is simple: valuable trial data shouldn't stay isolated after submission work is done.

The Foundation for a Successful ETL Project

A six-step infographic detailing the strategic planning process for converting SDTM clinical trial data to OMOP.

Most SDTM to OMOP failures happen before the first production run. The code usually isn't the root problem. The planning is.

Teams get into trouble when they assume domain mapping is mostly mechanical. It isn't. You need a source-to-target specification that captures not only table mappings, but timing assumptions, vocabulary strategy, provenance handling, and what you'll do when SDTM carries context that OMOP standard tables don't represent directly.

Start with a mapping specification, not SQL

The mapping document should be detailed enough that a second engineer could implement the pipeline without interviewing the original author. If that sounds strict, good. Trial ETL needs that level of auditability.

A practical specification usually needs these components:

  • Scope boundaries. Define which SDTM domains are in scope for the first release and which are deferred.
  • Target table decisions. Record where each SDTM domain or variable lands in OMOP, including any exceptions.
  • Vocabulary rules. State how source terms, coded values, and unmapped values will be resolved.
  • Context preservation. Decide what goes into custom concepts, auxiliary fields, or relationship tables.
  • Validation expectations. Name the checks required before data is released for analysis.

This document becomes the contract between clinical operations, data engineering, and analytics. Without it, every discrepancy turns into a meeting.

Dates are where pipelines break

A practical SDTM to OMOP ETL must reconstruct missing temporal precision from partial dates, and OHDSI guidance notes that teams often impute incomplete dates using anchors such as study enrollment or treatment start, while also emphasizing that there are no universal guidelines, so each study needs documented local rules, as shown in the OHDSI SDTM to OMOP transformation guidance.

That one issue drives a surprising amount of downstream risk. If one study imputes an adverse event to the first day of the month and another uses enrollment day, your cohort logic, exposure windows, and outcome timing can diverge in ways that look analytical rather than operational.

Practical rule: Never hide date imputation inside transformation code without exposing the rule in documentation and QA outputs.

Use a decision table. Keep it versioned. Make it study-specific when needed.

A planning checklist that actually helps

Before building the pipeline, force answers to these questions:

  1. What analysis will this OMOP dataset support first? Trial planning, safety review, cross-trial comparison, phenotype development, and general-purpose reuse don't all require the same level of fidelity.
  2. Which timing fields are trustworthy? Distinguish exact dates, partial dates, visit-relative dates, and derived dates.
  3. What source metadata must survive? Study identifiers, protocol context, severity, outcome, and dose-change rationale are common examples.
  4. Where will ambiguity live? If a value can't be mapped cleanly, decide whether you'll preserve it in source fields, custom concepts, or relationship structures.
  5. How will reviewers inspect the ETL? If the answer is “read the code,” you haven't made it auditable enough.

Build for repeatability from day one

A one-off script might get a study loaded. It won't give you a sustainable clinical trial ETL program.

What works better is a rule-driven pipeline with explicit configuration layers for study-specific behaviors. Keep general transformation logic separate from per-study exceptions. Store mapping tables outside compiled code. Log every imputation and fallback path. If a reviewer asks why a condition start date looks the way it does, you should be able to answer from metadata, not memory.

Core ETL Domain Mapping Patterns and Code

The fastest way to make a bad SDTM to OMOP pipeline is to force one-to-one mappings everywhere. Some domains map cleanly enough to start. Others only work if you preserve context outside the obvious target table.

The OHDSI Clinical Trial Working Group has emphasized that general SDTM-to-OMOP guidance is needed because the work should be treated as a rule-based, auditable transformation aimed at preserving analyzable temporal ordering and semantic fidelity for cross-trial analytics, as described in the OHDSI Europe clinical trials poster.

DM to PERSON and OBSERVATION_PERIOD

The DM domain is usually your cleanest entry point. It gives you subject identity, demographic attributes, and often the best enrollment anchor available for setting observation start.

A typical pattern looks like this:

insert into person (
  person_id,
  gender_concept_id,
  year_of_birth,
  month_of_birth,
  day_of_birth,
  race_concept_id,
  ethnicity_concept_id,
  person_source_value
)
select
  subject_key as person_id,
  map_gender(sex) as gender_concept_id,
  extract_year(brthdtc) as year_of_birth,
  extract_month(brthdtc) as month_of_birth,
  extract_day(brthdtc) as day_of_birth,
  map_race(race) as race_concept_id,
  map_ethnicity(ethnic) as ethnicity_concept_id,
  usubjid as person_source_value
from sdtm_dm;

That's straightforward, but the trap is assuming DM alone defines the full person timeline. In practice, many teams use enrollment or study participation anchors to populate OBSERVATION_PERIOD, and they document exactly which source fields establish the boundaries.

AE to CONDITION_OCCURRENCE

AE often maps into CONDITION_OCCURRENCE, but only after you settle three hard questions: what date to use, how to map the event term, and where to keep attributes such as severity or outcome.

insert into condition_occurrence (
  condition_occurrence_id,
  person_id,
  condition_concept_id,
  condition_start_date,
  condition_end_date,
  condition_type_concept_id,
  condition_source_value
)
select
  next_id() as condition_occurrence_id,
  subject_key as person_id,
  resolve_standard_condition(aeterm_code) as condition_concept_id,
  derive_event_start(aestdtc, enrollment_date, treatment_start_date) as condition_start_date,
  derive_event_end(aeendtc, aestdtc, enrollment_date) as condition_end_date,
  trial_event_type_concept() as condition_type_concept_id,
  aeterm as condition_source_value
from sdtm_ae;

What doesn't work is stuffing every AE attribute into a single OMOP row and pretending nothing was lost. Severity, action taken, outcome, and causality often need supporting structures. In many projects, the right answer is to preserve those attributes separately and link them back rather than flattening them away.

If the standard table loses clinically important nuance, keep the nuance and create the linkage. Don't reward a neat schema at the expense of the analysis.

LB and VS to MEASUREMENT

LB and VS usually land in MEASUREMENT because they represent quantitative or categorical assessments tied to a date and often a unit.

insert into measurement (
  measurement_id,
  person_id,
  measurement_concept_id,
  measurement_date,
  value_as_number,
  value_as_concept_id,
  unit_concept_id,
  measurement_source_value,
  unit_source_value
)
select
  next_id() as measurement_id,
  subject_key as person_id,
  resolve_standard_measurement(lbtestcd, lbtest) as measurement_concept_id,
  derive_measurement_date(lbdtc, visit_date, enrollment_date) as measurement_date,
  parse_numeric_result(lbstresn) as value_as_number,
  map_qual_result(lbstresc) as value_as_concept_id,
  resolve_unit(lbstresu) as unit_concept_id,
  lbtestcd as measurement_source_value,
  lbstresu as unit_source_value
from sdtm_lb;

Vital signs follow the same shape. The primary effort centers on unit normalization, categorical result handling, and preserving the visit context when protocol timing matters to later analysis.

Three mapping habits that hold up in production

  • Prefer reusable rule functions. Date derivation, concept resolution, and unit normalization should be centralized. Don't duplicate logic across domain scripts.
  • Keep source values visible. Even after mapping to standard concepts, retain source fields where OMOP allows it. Reviewers need a path back to the original trial language.
  • Model exceptions deliberately. Unsupported or ambiguous cases should go to exception tables or review queues, not disappear in silent nulls.

That combination is what makes the ETL defensible. Not clever SQL.

Mastering Vocabulary and Concept Mapping

Vocabulary work is where most timelines slip. The mapping logic itself is hard enough, but the operational burden is what usually gets ignored. Teams download ATHENA files, stand up a local database, load vocabulary tables, maintain release cycles, and then write custom SQL joins for every concept resolution path.

That can work. It's also why vocabulary mapping becomes a bottleneck instead of a shared service.

The old workflow versus the API workflow

Here's the practical difference.

ApproachWhat you manageWhat slows teams down
Local ATHENA setupVocabulary download, database load, refresh process, SQL mapping logicRelease maintenance, environment drift, duplicated lookup code
API-based terminology accessApplication calls, caching strategy, audit of lookup outcomesDependency management, request design, handling fallback rules

For SDTM to OMOP, the pain shows up in repeated tasks. AE terms, LB tests, VS tests, units, source codes, and custom study values all need resolution paths. If your engineers have to handcraft those lookups repeatedly, the ETL becomes fragile.

This screenshot gives a simple example of the kind of lookup workflow teams use during mapping review:

Screenshot from https://omophub.com/tools/concept-lookup

For manual review, a browser tool such as the OMOP concept lookup interface is useful. For pipelines, you want the same logic available programmatically.

A practical API pattern

One option is OMOPHub, which exposes REST and FHIR terminology operations over OHDSI vocabularies and can resolve a code to an OMOP standard concept and target CDM table. That's relevant in SDTM to OMOP work because it reduces the amount of local vocabulary infrastructure you need to build and maintain. Teams evaluating terminology workflows may also want the vendor's overview of OMOP concept mapping patterns.

A minimal example for resolving a code in Python looks like this:

import requests

url = "https://api.omophub.com/v1/fhir/resolve"
headers = {
    "Authorization": "Bearer oh_your_api_key",
    "Content-Type": "application/json",
}
payload = {
    "system": "http://snomed.info/sct",
    "code": "44054006",
    "resource_type": "Condition"
}

response = requests.post(url, headers=headers, json=payload, timeout=30)
response.raise_for_status()

data = response.json()
print(data)

That example is aligned with the product's published API pattern and is enough to test concept resolution during ETL development. If you prefer SDKs, there are maintained clients for Python, R, and an MCP server. The broader API and FHIR surface is documented in the OMOPHub developer docs.

Example API Call to Resolve a Code

Code
{"system": "http://snomed.info/sct", "code": "44054006", "resource_type": "Condition"}

What actually works for mapping governance

Don't rely on live lookups alone. Even if you use an API, freeze the mapping outputs used for a study release. A sound process usually includes:

  • Approved mapping tables for study-specific terms that have already been reviewed.
  • Fallback workflows for codes that don't resolve cleanly to a standard concept.
  • Version capture so you know which vocabulary state supported each release.
  • Human review queues for ambiguous values, especially adverse event terminology and local lab naming.

Field note: The best vocabulary process feels boring. Engineers know where mappings come from, reviewers know where to challenge them, and analysts don't have to guess what “unmapped” means.

Ensuring Fidelity with Validation and QA

An infographic showing six steps for validating SDTM to OMOP ETL processes to ensure high data quality.

Validation is where SDTM to OMOP becomes either a trustworthy asset or a dangerous convenience. Many teams stop after row counts, null checks, and successful loads. That's necessary, but it isn't enough for clinical research.

A 2025 study in Patterns compared statistical analyses run on original SDTM data with analyses on transformed OMOP-CDM data to assess information loss, showing that the benchmark for success has moved beyond load completion to preservation of statistical utility, as reported in the Patterns study on SDTM and OMOP statistical utility.

Validate in layers

Think about QA in three layers: structural, semantic, and analytical.

Structural checks answer basic integrity questions.

  • Row reconciliation. Compare source and target counts where a near-direct mapping is expected.
  • Field conformance. Confirm date types, numeric fields, and required OMOP columns are populated correctly.
  • Referential integrity. Verify person links and event references survive the transformation.

Semantic checks ask whether the meaning still holds.

  • Concept distribution review. Sample frequent mapped concepts and inspect whether the distribution looks plausible.
  • Source-value traceability. Pick records in OMOP and trace them back to SDTM values without guesswork.
  • Timing sanity checks. Confirm events don't appear before enrollment unless the study logic explicitly allows it.

Analytical checks are the final filter.

  • Run a small set of representative analyses on the SDTM source and the OMOP target.
  • Compare cohort definitions at a high level.
  • Inspect whether temporal ordering changes in ways your imputation rules should explain.

What a useful QA pack looks like

A QA deliverable should help an analyst challenge the data, not just reassure an engineer. Include:

QA artifactWhy it matters
Mapping exception logShows what wasn't cleanly transformed
Date imputation summaryMakes temporal assumptions visible
Concept audit sampleSupports manual review of coded mappings
Parallel query resultsReveals analytical drift between source and target

If your team needs a general refresher on how to ensure document data accuracy, that broader validation mindset applies here too. OMOP ETL just raises the stakes because errors can alter downstream evidence generation.

You should also maintain reusable data quality checks around your OMOP release process. This is one area where a structured framework helps, and the guide to data quality checking in OMOP workflows is a relevant reference for building that discipline into recurring runs.

Good QA doesn't ask, “Did the load finish?” It asks, “Would I trust an analyst to publish from this dataset?”

Best Practices and Advanced Considerations

Production-grade SDTM to OMOP work depends on discipline more than cleverness. The teams that keep these pipelines healthy over time do a few things consistently.

Preserve context that standard tables can't hold alone

Not every clinically important attribute has a tidy destination in a single OMOP row. Trial-specific nuance often lives in qualifiers, protocol timing, and event attributes that analysts still need later.

Use supporting structures deliberately. If severity, outcome, dose-change rationale, or other contextual fields matter for interpretation, preserve them with explicit relationships and documented provenance. Flattening those details away makes the dataset look cleaner while making the science weaker.

Version everything that affects interpretation

Treat ETL code, mapping tables, vocabulary state, and date-imputation rules as versioned assets. If a study is rerun months later, you should be able to explain whether a result changed because the source changed, the vocabulary changed, or your transformation logic changed.

That also means avoiding “quiet fixes” in production. If a mapping rule changes, record it, test it, and publish it with the release notes for the dataset.

Optimize for repeat runs, not heroic runs

Large trial portfolios expose weak pipeline design quickly. What works in development often fails at scale because engineers embedded study-specific exceptions directly in SQL, mixed review logic with load logic, or made concept mapping dependent on manual intervention at runtime.

A stronger pattern looks like this:

  • Separate shared rules from study configuration so new studies don't require code forks.
  • Cache reviewed mappings so stable source values don't trigger repeated lookup work.
  • Build exception queues for unresolved records rather than blocking entire loads.
  • Emit audit logs for every derived date, fallback mapping, and context-preservation decision.

Keep the target analytical, not merely compliant

A finished load isn't the goal. A reusable analytical asset is.

If you keep that standard in front of the team, design decisions become easier. You stop asking whether a record can be inserted somewhere and start asking whether the transformed dataset still supports valid reasoning across studies.


If your SDTM to OMOP pipeline is getting stuck on terminology resolution, vocabulary maintenance, or repeatable mapping workflows, OMOPHub is one practical option to evaluate. It gives ETL teams programmatic access to OHDSI vocabularies through REST and FHIR APIs, plus SDKs for Python and R, so you can externalize concept lookup and mapping support instead of maintaining all of that infrastructure locally.

Share: