A financial organization ingests data from Oracle, SQL Server, and PostgreSQL systems plus partner flat files. Data arrives with inconsistencies, schema drift, and no automated quality checks. This demonstrates an Azure medallion lakehouse that moves every source from staging to trusted mart — and proves it.
I demonstrate a medallion lakehouse on Azure. Heterogeneous sources — Oracle, SQL Server, PostgreSQL, and flat files — land in Bronze on Azure Data Lake Storage, are cleansed and conformed into Silver, and are published as governed star-schema marts in Gold on Azure Synapse. Azure Data Factory orchestrates the staging → transform → mart flow, with legacy SSIS packages lifted onto an Azure-SSIS Integration Runtime. Logging tables and Log Analytics track row counts, null checks, referential integrity, and schema drift; a Python anomaly gate (Isolation Forest) quarantines outliers; and a governance dashboard consumes those logs for quality, lineage, and SLA evidence.
A financial organization ingests data from an Oracle core-origination system, a SQL Server servicing platform, a PostgreSQL CRM, and partner flat files. Data arrives with inconsistencies, schema drift, and no automated quality checks or anomaly detection. Bad rows reach reporting before anyone notices, and QA is manual and reactive.
The pattern is a medallion lakehouse. Raw extracts land untouched in Bronze on Azure Data Lake Storage (Parquet); they are cleansed, deduplicated, and conformed into Silver; then shaped into star-schema marts in Gold on an Azure Synapse dedicated SQL pool. The validation + anomaly gate sits between Silver and Gold, so only trusted data is ever published to the Power BI semantic layer.
One ADF pipeline, pl_ingest_to_gold, carries each source from staging to mart. A
Copy activity (self-hosted IR for the on-prem Oracle and SQL Server sources) lands Parquet in
Bronze; a Mapping Data Flow cleanses and conforms Bronze → Silver with SCD Type 2; a
stored-procedure step builds the Gold marts in Synapse; and a Databricks/Python activity runs the
anomaly gate before publish. Every activity writes run metrics to logging tables and Azure Log
Analytics.
{
"name": "pl_ingest_to_gold",
"activities": [
{ "name": "Copy_Source_to_Bronze", "type": "Copy",
"inputs": ["ds_oracle_origination"], "outputs": ["ds_adls_bronze_parquet"],
"connectVia": "SelfHostedIR" },
{ "name": "DF_Bronze_to_Silver", "type": "ExecuteDataFlow",
"dependsOn": ["Copy_Source_to_Bronze"],
"typeProperties": { "dataFlow": "df_conform_scd2" } },
{ "name": "SP_Load_Gold_Marts", "type": "SqlServerStoredProcedure",
"dependsOn": ["DF_Bronze_to_Silver"],
"typeProperties": { "storedProcedureName": "gold.usp_load_fact_marts" } },
{ "name": "Validate_and_Gate", "type": "DatabricksNotebook",
"dependsOn": ["SP_Load_Gold_Marts"],
"typeProperties": { "notebookPath": "/dq/anomaly_gate" } }
]
}
Not everything is rebuilt. A proven nightly package, nightly_extract.dtsx, still runs its
original control flow — truncate staging, an OLE DB data flow with lookups and derived columns, then
failure alerting — but now hosted on an Azure-SSIS Integration Runtime and triggered by the same
ADF pipeline. Existing SSIS investment keeps running while new sources move to Mapping Data Flows.
An Isolation Forest scores each batch for outliers; flagged rows are quarantined and surfaced to the governance dashboard rather than silently loaded. The gate runs as an ADF Databricks activity; lightweight Shell scripts handle file landing and archival, and Power Query / M performs the same shaping for self-service refreshes in Power BI.
from sklearn.ensemble import IsolationForest import pandas as pd def score_batch(df: pd.DataFrame, features: list) -> pd.DataFrame: """Flag statistical outliers before load. Conservative contamination.""" model = IsolationForest( n_estimators=200, contamination=0.01, random_state=42, ) scores = model.fit_predict(df[features]) df["is_anomaly"] = (scores == -1) return df # flagged rows are written to dq.AnomalyLog and moved to a quarantine table
The pipeline scores four dimensions plus an anomaly rate; the composite feeds the governance dashboard and the SLA report (sample values).
A single pane consumes pipeline logs and anomaly output to show quality trends and SLA compliance. Filter by source, hover the charts for values, or select a source to drill through to the anomaly log. Sample data.
| Row Key | Column | Expected | Actual | Score | Status |
|---|
The example system combines Azure pipeline engineering with governance:
This example shows how combining ETL + quality + governance can, in practice: