Interactive case study

Azure ETL, Data Quality & Governance

A financial organization ingests data from Oracle, SQL Server, and PostgreSQL systems plus partner flat files. Data arrives with inconsistencies, schema drift, and no automated quality checks. This demonstrates an Azure medallion lakehouse that moves every source from staging to trusted mart — and proves it.

Executive Summary

I demonstrate a medallion lakehouse on Azure. Heterogeneous sources — Oracle, SQL Server, PostgreSQL, and flat files — land in Bronze on Azure Data Lake Storage, are cleansed and conformed into Silver, and are published as governed star-schema marts in Gold on Azure Synapse. Azure Data Factory orchestrates the staging → transform → mart flow, with legacy SSIS packages lifted onto an Azure-SSIS Integration Runtime. Logging tables and Log Analytics track row counts, null checks, referential integrity, and schema drift; a Python anomaly gate (Isolation Forest) quarantines outliers; and a governance dashboard consumes those logs for quality, lineage, and SLA evidence.

3-layer
Medallion · ADLS → Synapse
96.4%
Composite quality score
ADF
Orchestrated staging → mart
−70%
Manual QA effort

01 Problem — example scenario

A financial organization ingests data from an Oracle core-origination system, a SQL Server servicing platform, a PostgreSQL CRM, and partner flat files. Data arrives with inconsistencies, schema drift, and no automated quality checks or anomaly detection. Bad rows reach reporting before anyone notices, and QA is manual and reactive.

02 Constraints

03 Architecture — medallion lakehouse on Azure

The pattern is a medallion lakehouse. Raw extracts land untouched in Bronze on Azure Data Lake Storage (Parquet); they are cleansed, deduplicated, and conformed into Silver; then shaped into star-schema marts in Gold on an Azure Synapse dedicated SQL pool. The validation + anomaly gate sits between Silver and Gold, so only trusted data is ever published to the Power BI semantic layer.

SourcesOracle · SQL Server · PostgreSQL · files
Bronzeraw landing · ADLS Parquet
Silvercleansed · conformed · SCD2
Validate + AI GateDQ checks · anomalies
Goldmarts · Synapse
SemanticPower BI · RLS

04 Orchestration — the Azure Data Factory pipeline

One ADF pipeline, pl_ingest_to_gold, carries each source from staging to mart. A Copy activity (self-hosted IR for the on-prem Oracle and SQL Server sources) lands Parquet in Bronze; a Mapping Data Flow cleanses and conforms Bronze → Silver with SCD Type 2; a stored-procedure step builds the Gold marts in Synapse; and a Databricks/Python activity runs the anomaly gate before publish. Every activity writes run metrics to logging tables and Azure Log Analytics.

Copysource → Bronze
Data FlowBronze → Silver · SCD2
Load Goldproc → Synapse marts
Validate + GateDatabricks · Python
Publishrefresh semantic model
pl_ingest_to_gold.jsonADF pipeline
{
  "name": "pl_ingest_to_gold",
  "activities": [
    { "name": "Copy_Source_to_Bronze", "type": "Copy",
      "inputs": ["ds_oracle_origination"], "outputs": ["ds_adls_bronze_parquet"],
      "connectVia": "SelfHostedIR" },
    { "name": "DF_Bronze_to_Silver", "type": "ExecuteDataFlow",
      "dependsOn": ["Copy_Source_to_Bronze"],
      "typeProperties": { "dataFlow": "df_conform_scd2" } },
    { "name": "SP_Load_Gold_Marts", "type": "SqlServerStoredProcedure",
      "dependsOn": ["DF_Bronze_to_Silver"],
      "typeProperties": { "storedProcedureName": "gold.usp_load_fact_marts" } },
    { "name": "Validate_and_Gate", "type": "DatabricksNotebook",
      "dependsOn": ["SP_Load_Gold_Marts"],
      "typeProperties": { "notebookPath": "/dq/anomaly_gate" } }
  ]
}

05 Lift-and-shift — the legacy SSIS package

Not everything is rebuilt. A proven nightly package, nightly_extract.dtsx, still runs its original control flow — truncate staging, an OLE DB data flow with lookups and derived columns, then failure alerting — but now hosted on an Azure-SSIS Integration Runtime and triggered by the same ADF pipeline. Existing SSIS investment keeps running while new sources move to Mapping Data Flows.

Truncatestaging tables
Data FlowOLE DB → Lookup → Derived Col → dest
Archivemove source file
Notifyon-failure email

06 AI anomaly detection — Python

An Isolation Forest scores each batch for outliers; flagged rows are quarantined and surfaced to the governance dashboard rather than silently loaded. The gate runs as an ADF Databricks activity; lightweight Shell scripts handle file landing and archival, and Power Query / M performs the same shaping for self-service refreshes in Power BI.

anomaly_gate.pyPython
from sklearn.ensemble import IsolationForest
import pandas as pd

def score_batch(df: pd.DataFrame, features: list) -> pd.DataFrame:
    """Flag statistical outliers before load. Conservative contamination."""
    model = IsolationForest(
        n_estimators=200,
        contamination=0.01,
        random_state=42,
    )
    scores = model.fit_predict(df[features])
    df["is_anomaly"] = (scores == -1)
    return df

# flagged rows are written to dq.AnomalyLog and moved to a quarantine table

07 Data-quality scorecard

The pipeline scores four dimensions plus an anomaly rate; the composite feeds the governance dashboard and the SLA report (sample values).

Completeness
99.1%
Consistency
97.8%
Timeliness
92.5%
Referential Integrity
99.6%
Anomaly Rate
0.8%

08 Governance dashboard

A single pane consumes pipeline logs and anomaly output to show quality trends and SLA compliance. Filter by source, hover the charts for values, or select a source to drill through to the anomaly log. Sample data.

Data Governance · Pipeline Health SOURCE · ALL SOURCES
Data-Quality Score by Dimension %
12-Month Quality Trend %
Anomalies by Type share
Composite Quality Score target 98.0%

09 Solution demonstration

The example system combines Azure pipeline engineering with governance:

10 Outcome — demonstration

This example shows how combining ETL + quality + governance can, in practice:

← Previous02 · SQL Performance & Modeling Next →04 · SSRS Modernization