Yajur Healthcare Data Lakehouse — Reference Implementation

The Sandbox Navigator
for Hospital Intelligence

// Tier 3 · 300+ bed · Apache Iceberg · NiFi · Kafka · Spark · Trino · Superset · MLflow · HAPI FHIR

Open Source Components

Clinical Use Cases

Implementation Phases

₹0

Licensing Cost

yajur-sandbox — bash — 120×40

yajur@lakehouse:~$ docker compose -f yajur-tier3.yml up -d

✓ [1/9] minio Started → http://localhost:9001 (OBJECT STORAGE)

✓ [2/9] hapi-fhir Started → http://localhost:8080 (FHIR R4 SERVER)

✓ [3/9] kafka Started → localhost:9092 (MESSAGE BROKER)

✓ [4/9] nifi Started → http://localhost:8443 (DATA INGESTION)

✓ [5/9] spark-master Started → http://localhost:8888 (DISTRIBUTED PROCESSING)

✓ [6/9] airflow Started → http://localhost:8080 (ORCHESTRATION)

✓ [7/9] trino Started → http://localhost:8082 (QUERY ENGINE)

✓ [8/9] superset Started → http://localhost:8088 (DASHBOARDS)

✓ [9/9] mlflow Started → http://localhost:5000 (ML PLATFORM)

◈ ALL SYSTEMS NOMINAL — Yajur Sandbox Lakehouse is live. Run ./bootstrap-data.sh to load MIMIC-IV + synthetic cohort.

yajur@lakehouse:~$ _

01 — Architecture

Three Layers.
One Unified Platform.

The Yajur Sandbox implements the full Tier 3 stack from the reference architecture: nine open source components assembled into a coherent, clinically purposeful data lakehouse. Every piece is containerised, every interface is documented, every data flow is traceable.

Layer 01

Storage

⬡ MinIO

❄ Apache Iceberg

⊛ Nessie Catalog

S3-compatible · ACID · Time-travel · Schema evolution

↓

Layer 02

Ingestion

⇉ Apache NiFi

⚡ Apache Kafka

⊕ HAPI FHIR

HL7 · FHIR R4 · DICOM · CSV · REST · Streaming

↓

Layer 03

Processing

⚙ Apache Spark

◈ dbt

⏱ Apache Airflow

Distributed · SQL transforms · Scheduled DAGs

↓

Layer 04

Intelligence

▷ Trino

◉ Apache Superset

⬡ MLflow

◎ Jupyter

Ad-hoc SQL · Dashboards · Model registry

02 — Data Flow

From Source to Clinical Signal

Every data point entering the lakehouse follows a deterministic path. NiFi extracts, Kafka streams, Spark processes, Iceberg persists, Trino queries, Superset renders. The flow is observable, auditable, and reproducible.

🏥

HIS / EMR

Source Systems

⇉

NiFi

Ingestion

⚡

Kafka

Streaming

⚙

Spark

Processing

❄

Iceberg

Table Format

▷

Trino

Query Engine

◉

Superset

Dashboards

FHIR Path

ABDM/ABHA bundles enter via HAPI FHIR → Spark extracts Patient, Condition, Observation resources → written as Parquet into Iceberg tables → joinable with HIS data via ABHA ID

PhysioNet Path

MIMIC-IV and MC-MED datasets loaded via load_mimic.py → schema mapped to hospital canonical model → Delta/Iceberg tables for ICU, labs, diagnoses, prescriptions

Synthetic Path

generate_synthetic.py creates 50k patient records with Indian demographics, ICD-10 diagnoses, pin codes → Claude API generates clinical notes → all joined in the bronze layer

03 — Components

Nine Components.
Zero Vendor Lock-in.

Every service in the sandbox is open source, containerised, and pre-configured for the Yajur reference architecture. Click any component to see its role, configuration, and integration points.

⬡

MinIO

Object Storage · S3-Compatible

The foundation layer. All raw data, Parquet files, and Iceberg table data lives in MinIO. Runs as a distributed cluster in production; single-node in the sandbox. Accessible at localhost:9001 via the web console.

port :9000 · :9001

❄

Apache Iceberg

Table Format · ACID · Time Travel

The table format layer that turns raw Parquet files in MinIO into ACID-compliant, version-controlled tables. Iceberg provides schema evolution, partition pruning, and time-travel queries — critical for clinical data auditing.

Nessie catalog · Spark integration

⊕

HAPI FHIR

FHIR R4 Server · ABDM Integration

Open source reference implementation of HL7 FHIR R4. Receives FHIR bundles from ABDM Health Information Exchange. Stores Patient, Condition, Observation, and DiagnosticReport resources. REST API for querying and validation.

port :8080 · FHIR R4

⇉

Apache NiFi

Data Ingestion · Flow-Based ETL

Visual, flow-based data integration. In the sandbox, NiFi flows connect to the synthetic HIS database, pull CSV exports, consume FHIR REST endpoints, and route data to Kafka topics. Pre-built flows for all 5 use case data sources included.

port :8443 · pre-built flows

⚡

Apache Kafka

Message Broker · Event Streaming

Decouples ingestion from processing. NiFi writes events to Kafka topics (opd-visits, lab-results, fhir-bundles, iot-vitals). Spark Structured Streaming consumes these topics and writes to Iceberg. Simulates real-time ward data flow.

port :9092 · 4 topics

⚙

Apache Spark

Distributed Processing · Batch + Streaming

Processes MIMIC-IV datasets, runs FHIR extraction jobs, executes dbt transformations at scale, and runs the readmission ML training pipeline. Configured with the Iceberg catalog for native table writes. Spark UI available for job monitoring.

port :8888 · :4040 (Spark UI)

◈

dbt Core

Data Transformation · SQL Modelling

Transforms raw data in the bronze layer into clean, enriched silver and gold tables. Pre-built models: stg_patients, stg_opd_visits, fct_admissions, dim_diagnoses, mart_population_health. Full lineage graph included.

bronze → silver → gold

▷

Trino

Query Engine · Federated SQL

Federated query engine that runs SQL across Iceberg tables in MinIO, HAPI FHIR REST endpoints, and the PostgreSQL metadata store — simultaneously, in a single query. All 5 use case queries are pre-loaded. Web UI available for ad-hoc exploration.

port :8082 · federated queries

◉

Apache Superset

Dashboards · Visualisation

5 pre-built dashboards — one for each clinical use case. Geographic heat maps (pin code choropleth), time-series trend charts, readmission risk tables, ANC gap trackers, and dengue early warning panels. Superset connects to Trino as its SQL engine.

port :8088 · 5 dashboards

⬡

MLflow + Jupyter

ML Platform · Experiment Tracking

MLflow tracks experiments, registers the readmission risk model, and serves the model via a REST endpoint. Jupyter provides the development notebook — a pre-built readmission_model.ipynb walks through data prep, feature engineering, logistic regression training, and MLflow logging end to end.

MLflow :5000 · Jupyter :8889

⊛

Nessie Catalog

Data Catalog · Git-like Versioning

Nessie gives your Iceberg tables Git-like branching. Create a branch, run experimental transformations, and merge only if results are correct — without touching production data. Integrates natively with Spark, Trino, and dbt. The sandbox includes a dev and main branch pre-configured.

port :19120 · REST catalog

04 — Clinical Use Cases

Five Questions.
Five Dashboards.

Each use case from the article is implemented as a complete data pipeline: NiFi flow → Kafka topic → Spark job → Iceberg table → dbt model → Trino query → Superset dashboard. Click to expand each use case and see the SQL.

UC-01

Neighbourhood Diabetes & Hypertension Surveillance

Surveillance

What it shows

A geographic heat map of diabetes and hypertension diagnosis density by pin code across the hospital's service area, updated nightly. Identifies clusters for community wellness camp targeting. Uses CPCB AQI data as an environmental enrichment layer.

Superset Dashboard

Choropleth map (OpenStreetMap) + bar chart by age group + trend line over 24 months. Powered by Superset's deck.gl geospatial layer connected to Trino.

Data Source

Synthetic OPD visits (50k patients) · MIMIC-IV diagnosis codes mapped to ICD-10 · Census 2011 pin code boundaries (GeoJSON)

Trino SQL Query

-- Diagnosis burden by pin code (Trino / Iceberg)
SELECT
  p.pin_code,
  COUNT(DISTINCT v.patient_id) AS patients,
  COUNT(*) AS total_visits,
  ROUND(
    COUNT(*)* 100.0 /
    SUM(COUNT(*)) OVER(),2
  ) AS pct_of_total
FROM iceberg.gold.fct_opd_visits v
JOIN iceberg.gold.dim_patients p
  ON v.patient_id = p.patient_id
WHERE (
  v.icd10_code LIKE 'E11%' -- T2 Diabetes
  OR v.icd10_code LIKE 'I10%' -- Hypertension
)
AND v.visit_date >= current_date - INTERVAL '12' MONTH
GROUP BY p.pin_code
ORDER BY patients DESC;

UC-02

Maternal & Child Health Gap Identification

Maternal Health

What it shows

A daily list of registered pregnant women who have missed their scheduled ANC visits, mothers who delivered but haven't returned for 6-week post-natal check, and infants due for immunisation. Feeds a daily call list for frontline staff.

Superset Dashboard

Three-panel layout: ANC completion funnel, post-natal gap tracker (days since delivery vs. last visit), immunisation due list sortable by pin code. Designed for the nursing station screen.

Data Source

Synthetic ANC registration records · Delivery records · MIMIC-IV maternal cohort · RCH programme indicator schema

Trino SQL Query

-- ANC-registered women: missed visits in last 4 weeks
SELECT
  p.patient_name,
  p.mobile_number,
  p.pin_code,
  a.edd AS expected_delivery,
  MAX(v.visit_date) AS last_anc_visit,
  current_date - MAX(v.visit_date) AS days_overdue
FROM iceberg.gold.dim_patients p
JOIN iceberg.gold.fct_anc_registrations a
  ON p.patient_id = a.patient_id
JOIN iceberg.gold.fct_opd_visits v
  ON p.patient_id = v.patient_id
  AND v.visit_type = 'ANC'
WHERE a.status = 'active'
GROUP BY p.patient_name, p.mobile_number,
         p.pin_code, a.edd
HAVING MAX(v.visit_date) < current_date
        - INTERVAL '28' DAY
ORDER BY days_overdue DESC;

UC-03

Seasonal Respiratory Clustering & Capacity Preparation

Respiratory

What it shows

Weekly respiratory admission counts overlaid with CPCB AQI data from the local monitoring station. Identifies AQI thresholds that historically precede admission spikes by 7–14 days, enabling proactive capacity and medication stock preparation.

Superset Dashboard

Dual-axis time-series chart (admissions vs. AQI) + admission spike alert panel + pre-computed 14-day AQI forecast correlation. Built using Superset's ECharts integration.

Data Source

MIMIC-IV respiratory cohort (COPD, Asthma, Pneumonia ICD codes) · CPCB AQI API (live + historical) · Synthetic OPD respiratory visits

Trino SQL Query

-- Weekly respiratory admissions + AQI correlation
SELECT
  date_trunc('week', a.admission_date) AS week,
  COUNT(*) AS respiratory_admissions,
  AVG(q.aqi_value) AS avg_aqi,
  MAX(q.aqi_value) AS peak_aqi
FROM iceberg.gold.fct_admissions a
JOIN iceberg.gold.dim_diagnoses d
  ON a.primary_diagnosis_id = d.diagnosis_id
LEFT JOIN iceberg.silver.ext_aqi_daily q
  ON CAST(a.admission_date AS DATE) = q.aqi_date
WHERE d.icd10_chapter = 'J'
  -- Chapter J: Respiratory system
AND a.admission_date >= current_date
    - INTERVAL '24' MONTH
GROUP BY 1
ORDER BY week;

UC-04

Vector-Borne Disease Early Warning

Outbreak Detection

What it shows

Weekly syndromic surveillance: counts of fever presentations with low platelet counts by residential pin code. A cluster forming in any pin code triggers an alert 10–14 days before it would register in district surveillance. Blood bank platelet stock is shown alongside alert levels.

Superset Dashboard

Alert heat map by pin code + rolling 4-week fever trend + platelet demand forecast panel. Colour-coded alert thresholds (green/amber/red) based on historical dengue season baselines.

Data Source

Synthetic fever OPD records with platelet lab values · MIMIC-IV labs (CBC panel mapped to LOINC) · Blood bank inventory table

Trino SQL Query

-- Dengue early warning: fever + thrombocytopenia clusters
WITH weekly_signals AS (
  SELECT
    p.pin_code,
    date_trunc('week', v.visit_date) AS week,
    COUNT(*) AS fever_cases,
    COUNT(CASE WHEN
      l.platelet_count < 100000
    THEN 1 END) AS thrombocytopenia_cases
  FROM iceberg.gold.fct_opd_visits v
  JOIN iceberg.gold.dim_patients p ON v.patient_id = p.patient_id
  LEFT JOIN iceberg.gold.fct_lab_results l ON v.visit_id = l.visit_id
  WHERE v.presenting_complaint ILIKE '%fever%'
  GROUP BY 1, 2
)
SELECT *, CASE
  WHEN thrombocytopenia_cases >= 5 THEN '🔴 ALERT'
  WHEN thrombocytopenia_cases >= 2 THEN '🟡 WATCH'
  ELSE '🟢 NORMAL'
END AS alert_level
FROM weekly_signals
ORDER BY week DESC, fever_cases DESC;

UC-05

30-Day Readmission Risk Prediction

ML Prediction

What it shows

At the time of discharge, every patient receives a 30-day readmission risk score (0–100) generated by a logistic regression model trained on MIMIC-IV. High-risk patients (score ≥ 70) appear in the daily "Follow-up Priority" list for the discharge planning team.

ML Pipeline

Trained in Jupyter on MIMIC-IV admissions data. Features: age, LOS, comorbidity count, Charlson score, discharge disposition, prior 90-day admissions. Registered in MLflow model registry. Served via REST endpoint. Predictions written back to Iceberg nightly via Airflow DAG.

Data Source

MIMIC-IV admissions + diagnoses + lab values (PhysioNet credentialed) · Charlson Comorbidity Index computed via dbt model · MLflow experiment tracking

Trino SQL Query

-- Today's high-risk discharge cohort
SELECT
  p.patient_name,
  p.patient_id,
  d.discharge_date,
  d.primary_diagnosis_display,
  r.risk_score,
  r.risk_tier,
  r.top_risk_factor
FROM iceberg.gold.fct_discharges d
JOIN iceberg.gold.dim_patients p
  ON d.patient_id = p.patient_id
JOIN iceberg.gold.fct_readmission_scores r
  ON d.discharge_id = r.discharge_id
WHERE d.discharge_date = current_date
  AND r.risk_tier = 'HIGH'
ORDER BY r.risk_score DESC;

05 — Data Sources

Four Data Streams.
One Unified Schema.

The sandbox ingests from four distinct sources — each reflecting a real-world hospital data modality. The bootstrap-data.sh script orchestrates all four loaders in sequence, normalising into a canonical hospital data model before the first Iceberg write.

🏛

MIMIC-IV

PhysioNet · Clinical Database

47k ICU patients. Full admission records, lab values (CBC, metabolic panel, cultures), diagnoses (ICD-10-CM), procedures, prescriptions, and nursing notes. The richest open clinical dataset available. Maps to Iceberg bronze tables via load_mimic.py.

Credentialed Access

🧬

MIMIC-IV Demo

PhysioNet · No Credentialing Required

100-patient subset of MIMIC-IV with no credentialing required. Available immediately. Used for the initial sandbox bootstrap so teams can start without waiting for PhysioNet access. Identical schema to the full dataset.

Free — No Credentials

🇮🇳

Synthetic Cohort

Python Generated · India-Specific

50,000 patients generated by generate_synthetic.py with Indian demographics, pin codes, ICD-10 diagnoses weighted to Indian epidemiology (diabetes, TB, typhoid, dengue). Clinical notes generated via Claude API with realistic discharge summary templates.

AI-Generated

⊕

FHIR R4 Bundles

HL7 FHIR · ABDM-Compatible

Synthetic FHIR R4 bundles generated by load_fhir.py: Patient, Condition, Observation (vitals, labs), MedicationRequest, DiagnosticReport. Loaded into HAPI FHIR server. Extracted nightly by Spark and written to iceberg.silver.fhir_* tables.

FHIR Synthetic

06 — Under the Hood

Every Service.
Observable.

The sandbox includes a live "Under the Hood" dashboard — a React UI showing all nine service statuses, the animated data flow, an embedded Trino SQL explorer, and a pipeline DAG view. Here's a preview of what it shows.

Service Health Monitor

minio

RELEASE.2024-01

:9000 · :9001

hapi-fhir

7.2.0

:8080

kafka + zookeeper

3.7.0

:9092

apache-nifi

2.0.0

:8443

spark-master

3.5.1

:8888 · :4040

airflow-webserver

2.9.2

:8085

trino

443

:8082

superset

4.0.1

:8088

mlflow

2.13.0

:5000

Airflow DAG Monitor

yajur_diabetes_surveillance

@daily

✓ 2h ago

yajur_maternal_gap_tracker

@daily

✓ 2h ago

yajur_respiratory_aqi_sync

@hourly

✓ 42m ago

yajur_dengue_early_warning

@daily

✓ 2h ago

yajur_readmission_scoring

@daily

⟳ running

yajur_fhir_to_iceberg_sync

@hourly

✓ 18m ago

yajur_dbt_gold_layer_refresh

@daily

✓ 2h ago

yajur_mimic_initial_load

@once

✓ completed

07 — Implementation Phases

Six Phases.
One Working Lakehouse.

The sandbox is built and delivered in six sequential phases. Each phase has a concrete deliverable that can be demonstrated independently. Together they compose the complete Yajur Healthcare Data Lakehouse reference implementation.

Current · Phase 1

Sandbox Navigator

Week 1

This interactive guide — the mission control for the entire sandbox. Architecture overview, component documentation, use case SQL library, and implementation roadmap. Shareable with hospital CTOs and investors as a standalone demo.

yajur-sandbox-navigator.html

✓ Complete

Docker Compose Stack

Week 1–2

Complete docker-compose.yml for all 14 services — PostgreSQL, MinIO, Nessie, HAPI FHIR, Kafka, NiFi, Spark, Airflow, Trino, Superset, MLflow, Jupyter. Pre-configured with env variables, volume mounts, health checks, and boot ordering. One command to launch. Open runbook →

docker-compose.yml · bootstrap.sh · 6 config files

✓ Complete

Data Generation Layer

Week 2

Four data loaders: generate_synthetic.py (50k Indian patients), load_mimic_demo.py (no credentials required), load_fhir.py (FHIR R4 bundles into HAPI), and generate_clinical_notes.py (Claude API for discharge summaries). Master bootstrap-data.sh orchestrates all four. Open runbook →

bootstrap-data.sh + 4 loaders

✓ Complete

dbt Models + Airflow DAGs

Week 2–3

Full dbt project: staging models (bronze), intermediate transforms (silver), and 5 analytical marts (gold) — one per use case. Eight Airflow DAGs including nightly HIS sync, FHIR extraction, dbt run, readmission scoring, and CPCB AQI ingestion. Full lineage documentation. Open runbook →

dbt/ + dags/ directories

✓ Complete

"Under the Hood" Dashboard

Week 3

Live infrastructure dashboard embedded in the Navigator: service health monitor with boot sequence animation (9 services), animated SVG data flow diagram with particle effects, SQL query explorer with mock Trino results for all 5 use cases, and Airflow DAG view with expandable task graphs and Gantt history. Open dashboard →

Embedded in yajur-sandbox-navigator.html

Phase 6

Implementation Guide

Week 3–4

Full Word + PDF documentation: architecture decision records, component configuration reference, PhysioNet/MIMIC-IV dataset access guide (credentialed and demo paths), Superset dashboard setup walkthrough, MLflow model training notebook guide, and production hardening checklist for hospital IT teams.

Yajur-Sandbox-Guide-v1.docx

// Phase 5 Complete

The lakehouse has
eyes on itself.

The Under the Hood dashboard is live — service health, animated data flow, SQL explorer, and DAG monitor in one place. Switch to Phase 5 to see every component of the lakehouse in motion.

Read the Article ↗

Variable	Current default	Action
POSTGRES_PASSWORD	`Yajur@Lakehouse2025!`	Change it — used by all 5 services sharing this DB
MINIO_ROOT_PASSWORD	`MinIO@Yajur2025!`	Change it — unlocks your entire data store
AIRFLOW_FERNET_KEY	`46BKJoQYl…`	Generate fresh — see command below
AIRFLOW_SECRET_KEY	`a2b4c6d8…`	Generate fresh — see command below
SUPERSET_SECRET_KEY	`yajur-superset-secret…`	Change it — any long random string
HOSPITAL_NAME / CITY	`Yajur Reference Hospital`	Optional — appears in Superset dashboard titles
ANTHROPIC_API_KEY	`your-anthropic-api-key`	Phase 3 only — add before running bootstrap-data.sh
PHYSIONET_USERNAME	`not-required`	Phase 3 only — MIMIC-IV Demo is open-access, no login needed

Port	Service	URL
9001	MinIO Console	http://localhost:9001
8180	HAPI FHIR	http://localhost:8180
19120	Nessie API	http://localhost:19120/api/v2/config
8443	Apache NiFi (HTTPS)	https://localhost:8443/nifi
8888	Spark Master UI	http://localhost:8888
8085	Airflow Webserver	http://localhost:8085
8082	Trino UI	http://localhost:8082
8088	Apache Superset	http://localhost:8088
5000	MLflow	http://localhost:5000
8890	Jupyter Lab	http://localhost:8890

Package	Used by	Version
pandas	generate_synthetic, load_mimic_demo	≥2.0
pyarrow	generate_synthetic (Parquet writes)	≥14.0
psycopg2-binary	all four scripts	≥2.9
boto3	generate_synthetic, generate_clinical_notes	≥1.34
faker	generate_synthetic	≥24.0
requests	load_mimic_demo, load_fhir	≥2.31
anthropic	generate_clinical_notes	≥0.25
python-dotenv	all four scripts	≥1.0

Key	Where to get it	Used by
ANTHROPIC_API_KEY	console.anthropic.com → API Keys	Required for generate_clinical_notes.py
POSTGRES_DB	Add `yajur_bronze` if not already there	All four scripts

Table	Expected rows	Source
bronze.patients	50,000	generate_synthetic.py
bronze.opd_visits	~150,000	generate_synthetic.py
bronze.anc_registrations	~3,600	generate_synthetic.py
bronze.lab_results	~22,500	generate_synthetic.py
bronze.discharges	~4,000	generate_synthetic.py
bronze.mimic_patients	100	load_mimic_demo.py
bronze.mimic_admissions	~500	load_mimic_demo.py
bronze.mimic_diagnoses	~2,000	load_mimic_demo.py
bronze.mimic_labevents	≤50,000	load_mimic_demo.py
bronze.clinical_notes	100	generate_clinical_notes.py

Phase 4 · Operational Guide

14 SQL models.
8 DAGs. Live gold tables.

Bronze zone is populated. Now transform it. Phase 4 runs a complete dbt project — 6 staging models, 2 intermediate enrichments, and 6 gold marts — writing Iceberg tables that feed all 5 clinical use case queries. Then 8 Airflow DAGs keep everything in sync on hourly and daily schedules.

dbt models

Airflow DAGs

Iceberg zones

Use cases live

Steps total

Prerequisites — install dbt-trino adapter and configure connection

Install dbt-trino Required: install first

Host machine (outside Docker) — dbt runs as a local CLI tool

What it is

dbt-core with the Trino adapter. dbt runs locally on your machine and connects to Trino at localhost:8082. Trino translates dbt's SQL into writes against the Iceberg catalog backed by MinIO. Requires Python 3.10+.

Your action

①Install into the same virtual environment you used for Phase 3 (or create a fresh one).

②Verify dbt can connect to Trino before creating any files.

⚠The sandbox uses dbt-trino 1.8.x. Do not install dbt-core standalone — it ships with the adapter. dbt-trino installs a compatible core version automatically.

bash

# Activate your Phase 3 venv (or create fresh) source venv/bin/activate # Install dbt-trino (includes dbt-core 1.8) pip install dbt-trino==1.8.0 # Verify dbt --version # core: 1.8.x trino: 1.8.0 # Unzip Phase 4 package into yajur-sandbox/ unzip yajur-sandbox-phase4.zip -d yajur-sandbox/ cd yajur-sandbox ls # dbt/ dags/ ... (existing phase2/3 files)

dbt/profiles.yml Required: one value to edit

~/.dbt/profiles.yml — dbt reads this from your home directory

What it is

The connection profile telling dbt-trino how to reach the Trino container. Points to the iceberg catalog and the gold schema where mart tables will be written as Apache Iceberg files on MinIO.

What to edit

Field	Default	Action
host	`localhost`	Leave as-is — Trino is on 8082
port	`8082`	Leave as-is
user	`dbt_runner`	Leave as-is — Trino in sandbox has no auth
catalog	`iceberg`	Leave as-is — targets MinIO-backed Iceberg catalog
schema	`gold`	Leave as-is — marts write here
http_scheme	`http`	Change to https only if you enabled TLS in Phase 2

yaml — ~/.dbt/profiles.yml

yajur_sandbox: target: dev outputs: dev: type: trino method: none host: localhost port: 8082 user: dbt_runner catalog: iceberg schema: gold http_scheme: http threads: 4 session_properties: query_max_run_time: 5m exchange_compression_codec: NONE

bash — copy to ~/.dbt/ and test

# Create dbt home dir if it doesn't exist mkdir -p ~/.dbt # Copy the profile cp dbt/profiles.yml ~/.dbt/profiles.yml # Test connectivity (run from dbt/ directory) cd dbt && dbt debug # All checks should pass. "Connection test: OK" confirms Trino is reachable.

ℹIf dbt debug fails with a connection error, confirm Trino is running: docker ps | grep trino. The Trino container can take 60–90 seconds to finish JVM warm-up after a stack restart.

dbt Project — project config, staging, intermediate, marts, run and test

dbt/dbt_project.yml Pre-configured

dbt/dbt_project.yml — project-level config + model materialisation defaults

What it is

Root configuration file for the dbt project. Sets materialisation defaults per layer: staging models materialise as Iceberg views in the silver schema; intermediate models as Iceberg views in silver; marts as Iceberg tables with full incremental rebuild in the gold schema. No changes required.

yaml — dbt/dbt_project.yml (key sections)

name: yajur_lakehouse version: '1.0.0' config-version: 2 profile: yajur_sandbox model-paths: ["models"] target-path: "target" vars: bronze_catalog: postgresql bronze_schema: bronze silver_schema: silver gold_schema: gold models: yajur_lakehouse: staging: +materialized: view +schema: silver +file_format: iceberg intermediate: +materialized: view +schema: silver +file_format: iceberg marts: +materialized: table +schema: gold +file_format: iceberg +on_schema_change: append_new_columns

ℹMarts use materialized: table (full rebuild on each dbt run). This is intentional for sandbox clarity — every run produces a deterministic gold table. Switch to incremental + unique_key for production once row volumes exceed 5M per table.

models/staging/ — 6 staging models Pre-built · review SQL

dbt/models/staging/stg_*.sql — bronze → silver views

What it is

Six staging models that read from postgresql.bronze.* via the Trino postgresql catalog. Each model casts types, applies null-safety, standardises naming, and adds a _dbt_loaded_at audit column. They materialise as Iceberg views in iceberg.silver.

postgresql.bronze.*

→

iceberg.silver.stg_*

→

iceberg.gold.fct_* / dim_*

stg_patients

Core patient dimension. Casts date_of_birth to DATE, computes age_years, normalises gender to M/F/O, trims whitespace from district and state.

sql — stg_patients.sql

-- stg_patients.sql WITH source AS ( SELECT * FROM {{ var('bronze_catalog') }}.{{ var('bronze_schema') }}.patients ), cleaned AS ( SELECT patient_id, TRIM(full_name) AS full_name, CAST(date_of_birth AS DATE) AS date_of_birth, DATE_DIFF('year', CAST(date_of_birth AS DATE), CURRENT_DATE) AS age_years, UPPER(TRIM(gender)) AS gender, COALESCE(TRIM(district), 'UNKNOWN') AS district, COALESCE(TRIM(state), 'UNKNOWN') AS state, pin_code, abha_id, phone_number, registration_date, CURRENT_TIMESTAMP AS _dbt_loaded_at FROM source WHERE patient_id IS NOT NULL ) SELECT * FROM cleaned

stg_opd_visits

OPD encounters. Validates visit_date is not future-dated. Parses icd10_code prefix to derive icd10_chapter (A–Z). Flags oncology visits where code starts with C or D0.

sql — stg_opd_visits.sql (key transform)

SELECT visit_id, patient_id, visit_date, facility_id, TRIM(icd10_code) AS icd10_code, TRIM(icd10_desc) AS icd10_desc, SUBSTR(icd10_code, 1, 1) AS icd10_chapter, CASE WHEN TRIM(icd10_code) LIKE 'C%' OR TRIM(icd10_code) LIKE 'D0%' THEN TRUE ELSE FALSE END AS is_oncology_flag, department, doctor_id, pin_code, CURRENT_TIMESTAMP AS _dbt_loaded_at FROM {{ var('bronze_catalog') }}.{{ var('bronze_schema') }}.opd_visits WHERE visit_date <= CURRENT_DATE AND patient_id IS NOT NULL

Other models

File	Source table	Key transforms
stg_anc_registrations.sql	`bronze.anc_registrations`	Casts `lmp_date`, computes gestational weeks, flags high-risk (age <18 or >35, haemoglobin <10)
stg_lab_results.sql	`bronze.lab_results`	Casts numeric value, flags abnormal (outside reference range), maps LOINC code where present
stg_discharges.sql	`bronze.discharges`	Computes LOS in days, extracts Charlson comorbidity fields, flags 30-day readmission window
stg_clinical_notes.sql	`bronze.clinical_notes`	Trims text, extracts department, flags AI-generated notes vs manual

models/intermediate/ — 2 enrichment models Pre-built · review SQL

dbt/models/intermediate/int_*.sql — silver → silver (enriched views)

What it is

Two derived models that enrich staged data before it hits gold marts. These are views (not tables) — they add computation without duplicating storage. Both read from iceberg.silver.stg_* views.

int_patient_comorbidities

Computes a simplified Charlson Comorbidity Index score per patient by scanning discharge diagnosis codes. Groups ICD-10 blocks into Charlson categories (MI, CHF, diabetes, renal disease, etc.), weights each, and sums to a charlson_score. This score drives the readmission risk model.

sql — int_patient_comorbidities.sql (key logic)

WITH comorbidities AS ( SELECT patient_id, SUM(CASE WHEN primary_icd10 BETWEEN 'I21' AND 'I22' THEN 1 ELSE 0 END) AS mi_score, SUM(CASE WHEN primary_icd10 BETWEEN 'I50' AND 'I509' THEN 1 ELSE 0 END) AS chf_score, SUM(CASE WHEN primary_icd10 BETWEEN 'E10' AND 'E14' THEN 1 ELSE 0 END) AS dm_score, SUM(CASE WHEN primary_icd10 BETWEEN 'N18' AND 'N189' THEN 2 ELSE 0 END) AS renal_score, SUM(CASE WHEN primary_icd10 LIKE 'C%' THEN 2 ELSE 0 END) AS cancer_score FROM {{ ref('stg_discharges') }} GROUP BY patient_id ) SELECT patient_id, mi_score + chf_score + dm_score + renal_score + cancer_score AS charlson_score, CASE WHEN mi_score + chf_score + dm_score + renal_score + cancer_score >= 3 THEN 'high' WHEN ... >= 1 THEN 'medium' ELSE 'low' END AS comorbidity_tier FROM comorbidities

int_oncology_staging

Joins stg_opd_visits + stg_discharges + stg_lab_results for patients where is_oncology_flag = TRUE. Adds a derived cancer_site field (breast, lung, GI, haem, head_neck, other) mapped from ICD-10 blocks. Used by the oncology alert DAG.

models/marts/ — 6 gold mart tables Full SQL in cards below

dbt/models/marts/fct_*.sql + dim_*.sql → iceberg.gold.*

What it is

Six gold-layer Iceberg tables materialised by dbt. Each maps to one or more of the 5 clinical use cases. These are the tables the Trino use-case queries in the Navigator point to. After dbt run, all 5 use case SQLs in the Navigator will return real data.

Mart catalogue

Model	Iceberg table	Use Cases
dim_patients.sql	`iceberg.gold.dim_patients`	All 5 — shared dimension
fct_opd_visits.sql	`iceberg.gold.fct_opd_visits`	UC-01 NCD surveillance · UC-04 Dengue
fct_anc_registrations.sql	`iceberg.gold.fct_anc_registrations`	UC-02 Maternal gap tracker
fct_admissions.sql	`iceberg.gold.fct_admissions`	UC-03 Respiratory AQI correlation
fct_discharges.sql	`iceberg.gold.fct_discharges`	UC-05 Readmission risk
fct_readmission_scores.sql	`iceberg.gold.fct_readmission_scores`	UC-05 — joins discharge + ML scores

fct_opd_visits

Joins stg_opd_visits + dim_patients. Adds calendar year/month columns, age_band (0–5, 6–18, 19–45, 46–65, 65+), and chronic_disease_flag for NCD case grouping. Powers the choropleth pin-code heatmap.

sql — fct_opd_visits.sql

SELECT v.visit_id, v.patient_id, p.district, p.state, p.pin_code, p.age_years, CASE WHEN p.age_years < 6 THEN '0-5' WHEN p.age_years < 19 THEN '6-18' WHEN p.age_years < 46 THEN '19-45' WHEN p.age_years < 66 THEN '46-65' ELSE '65+' END AS age_band, v.icd10_code, v.icd10_desc, v.icd10_chapter, v.is_oncology_flag, v.department, v.visit_date, YEAR(v.visit_date) AS visit_year, MONTH(v.visit_date) AS visit_month, CASE WHEN v.icd10_chapter IN ('E','I','J','K','N') THEN TRUE ELSE FALSE END AS chronic_disease_flag FROM {{ ref('stg_opd_visits') }} v JOIN {{ ref('dim_patients') }} p ON v.patient_id = p.patient_id

fct_readmission_scores

The most complex mart. Joins stg_discharges + dim_patients + int_patient_comorbidities. Adds discharge disposition flags, prior 90-day admission count (window function), and leaves a ml_readmission_score column null (filled nightly by the ML DAG). Powers UC-05.

sql — fct_readmission_scores.sql (key window)

WITH prior_admits AS ( SELECT patient_id, discharge_date, COUNT(*) OVER ( PARTITION BY patient_id ORDER BY discharge_date RANGE BETWEEN INTERVAL '90' DAY PRECEDING AND CURRENT ROW ) - 1 AS prior_90d_admissions FROM {{ ref('stg_discharges') }} ) SELECT d.discharge_id, d.patient_id, p.age_years, d.length_of_stay_days, c.charlson_score, c.comorbidity_tier, a.prior_90d_admissions, d.discharge_disposition, CAST(NULL AS DOUBLE) AS ml_readmission_score, -- filled by DAG 4 d.discharge_date FROM {{ ref('stg_discharges') }} d JOIN {{ ref('dim_patients') }} p ON d.patient_id = p.patient_id JOIN {{ ref('int_patient_comorbidities') }} c ON d.patient_id = c.patient_id JOIN prior_admits a ON d.patient_id = a.patient_id AND d.discharge_date = a.discharge_date

models/*/schema.yml — data tests Pre-built · review tests

dbt/models/staging/schema.yml + dbt/models/marts/schema.yml

What it is

YAML test definitions that dbt test runs after each build. They catch data issues in bronze before they propagate to gold. No changes needed.

Test coverage

Model	Tests applied
stg_patients	`not_null(patient_id)` · `unique(patient_id)` · `accepted_values(gender: [M,F,O])`
stg_opd_visits	`not_null(visit_id, patient_id)` · `relationships(patient_id → stg_patients)`
stg_anc_registrations	`not_null(registration_id)` · `not_null(lmp_date)`
fct_opd_visits	`not_null(visit_id)` · `unique(visit_id)`
fct_readmission_scores	`not_null(discharge_id)` · `unique(discharge_id)` · `not_null(charlson_score)`

⚠dbt test will flag ~3% null ABHA IDs in stg_patients — this is expected. The synthetic data generator leaves ABHA blank for patients registered before 2021. These tests are configured as warn (not error) so the build does not fail.

dbt run + dbt test Required: run this now

cd dbt/ → dbt run → dbt test

What to do

Run the full dbt project. This triggers 14 model builds in dependency order — staging first, then intermediate, then marts. Trino writes each result as an Iceberg table/view to MinIO. Total run time: 4–8 minutes depending on row counts.

①Ensure the Phase 2 Docker stack is fully up: docker ps | grep -c Up should return 14.

②Run dbt run first. Watch for any red ERROR lines. If a staging model errors, check that the bronze table exists (run Phase 3 data loaders first).

③Run dbt test after. Warnings are acceptable. Errors mean a not-null or unique constraint failed in a mart — investigate before deploying DAGs.

bash

# From the dbt/ directory cd yajur-sandbox/dbt # Full project run (staging → intermediate → marts) dbt run # Expected output: 14 of 14 OK (2–8 min) # ✓ stg_patients ... OK (view) ✓ stg_opd_visits ... OK (view) # ✓ int_patient_comorbidities ... OK (view) # ✓ fct_opd_visits ... OK (table) ✓ fct_readmission_scores ... OK (table) # Run tests dbt test # WARN on abha_id nulls is expected. PASS on all not_null + unique tests. # Optional: generate + serve docs dbt docs generate && dbt docs serve --port 8085 # Open http://localhost:8085 — full lineage graph in browser

ℹAfter dbt run, open Trino UI at localhost:8082 and run SHOW TABLES IN iceberg.gold. You should see 6 tables. This confirms Trino has written Iceberg tables to MinIO. The use case queries in the Navigator now point at live data.

Airflow DAGs — 8 DAGs covering sync, transform, ML scoring, alerts, and quality

DAG 1: dag_nightly_bronze_sync.py + DAG 2: dag_fhir_extraction.py Ingestion DAGs

dags/dag_nightly_bronze_sync.py · dags/dag_fhir_extraction.py

DAG 1: Bronze sync

Runs @hourly. Checks HIS PostgreSQL for new rows written since last watermark. Uses BashOperator to call a lightweight Python script that INSERTs new rows into the bronze tables. Persists watermark in a dedicated airflow_metadata table in PostgreSQL. In the sandbox this re-syncs synthetic data; in production this connects to the live HIS DB.

python — dag_nightly_bronze_sync.py (skeleton)

from airflow.decorators import dag, task from airflow.operators.bash import BashOperator from datetime import datetime, timedelta @dag( dag_id='yajur_bronze_sync', schedule='@hourly', start_date=datetime(2025, 1, 1), catchup=False, tags=['ingestion', 'bronze'], default_args={'retries': 2, 'retry_delay': timedelta(minutes=5)}, ) def bronze_sync(): sync_patients = BashOperator( task_id='sync_patients', bash_command='python3 /opt/airflow/scripts/sync_bronze.py --table patients', ) sync_opd = BashOperator(task_id='sync_opd_visits', bash_command='python3 /opt/airflow/scripts/sync_bronze.py --table opd_visits',) sync_labs = BashOperator(task_id='sync_lab_results', bash_command='python3 /opt/airflow/scripts/sync_bronze.py --table lab_results',) sync_patients >> sync_opd >> sync_labs bronze_sync()

DAG 2: FHIR extraction

Runs @daily. Polls the HAPI FHIR server at localhost:8180 for Patient, Condition, Observation, and DiagnosticReport resources updated since the last execution. Writes normalised rows into bronze.fhir_events and bronze.fhir_observations. Uses FHIR _lastUpdated parameter for incremental pulls.

python — dag_fhir_extraction.py (key task)

@task def extract_fhir_resources(resource_type: str, last_run: str) -> int: url = f"http://hapi-fhir:8080/fhir/{resource_type}" params = {'_lastUpdated': f'gt{last_run}', '_count': '500', '_format': 'json'} resp = requests.get(url, params=params, timeout=30) bundle = resp.json() rows = [flatten_fhir_resource(e['resource']) for e in bundle.get('entry', [])] if rows: insert_to_bronze(resource_type.lower(), rows) # psycopg2 bulk insert return len(rows)

DAG 3: dag_dbt_run.py + DAG 4: dag_readmission_scoring.py Transform + ML DAGs

dags/dag_dbt_run.py · dags/dag_readmission_scoring.py

DAG 3: dbt run

Runs @daily at 02:00. The core transformation DAG. Executes dbt run followed immediately by dbt test using BashOperator. If any dbt test fails with severity error, the DAG marks the task as failed and Airflow sends an alert email. Gold tables are rebuilt nightly from the latest bronze data.

python — dag_dbt_run.py

@dag(dag_id='yajur_dbt_gold_layer_refresh', schedule='0 2 * * *', start_date=datetime(2025, 1, 1), catchup=False, tags=['dbt','gold']) def dbt_gold_refresh(): dbt_run = BashOperator( task_id='dbt_run', bash_command='cd /opt/airflow/dbt && dbt run --profiles-dir /opt/airflow/dbt', ) dbt_test = BashOperator( task_id='dbt_test', bash_command='cd /opt/airflow/dbt && dbt test --profiles-dir /opt/airflow/dbt', ) dbt_run >> dbt_test dbt_gold_refresh()

DAG 4: ML scoring

Runs @daily at 04:00 (after DAG 3 completes). Queries iceberg.gold.fct_readmission_scores for rows where ml_readmission_score IS NULL and discharge date is within 7 days. Calls the MLflow-registered model REST endpoint. Writes scores back via Trino UPDATE. Logs model version and batch metrics to MLflow run.

python — dag_readmission_scoring.py (scoring task)

@task def score_new_discharges(): conn = trino.connect(host='trino', port=8080, user='airflow') cur = conn.cursor() cur.execute(""" SELECT discharge_id, age_years, length_of_stay_days, charlson_score, prior_90d_admissions, discharge_disposition FROM iceberg.gold.fct_readmission_scores WHERE ml_readmission_score IS NULL AND discharge_date >= CURRENT_DATE - INTERVAL '7' DAY """) rows = cur.fetchall() # batch score via MLflow REST API scores = call_mlflow_endpoint(rows) # write back via Trino MERGE INTO write_scores_to_gold(scores)

DAGs 5–8: AQI ingestion, oncology alerts, ANC follow-up, data quality Clinical + QA DAGs

dags/dag_aqi_ingestion.py · dag_oncology_alerts.py · dag_anc_followup.py · dag_data_quality.py

DAG 5: AQI ingest

Runs @hourly. Calls the CPCB (Central Pollution Control Board) open API for AQI readings across Indian cities. Maps to the same pin_code index used by patient records. Writes to bronze.aqi_readings. Powers UC-03 respiratory correlation. No API key required — CPCB data.gov.in endpoint is open.

DAG 6: Oncology alerts

Runs @daily at 06:00. Queries iceberg.silver.int_oncology_staging for patients where cancer_site IS NOT NULL and no OPD follow-up visit exists within 60 days of last discharge. Writes alert records to bronze.clinical_alerts with priority level HIGH. In production these feed the nurse call centre workflow.

DAG 7: ANC follow-up

Runs @daily at 07:00. Queries iceberg.gold.fct_anc_registrations for mothers flagged high_risk = TRUE with no visit in the last 28 days. Creates follow-up reminder records in bronze.anc_followup_queue. Powers UC-02 maternal gap tracker.

DAG 8: Data quality

Runs @daily at 01:00 (before dbt). Checks row counts and null rates across all 6 bronze tables. If any table has 0 new rows in 24 hours, or if patient_id null rate exceeds 1%, the task fails with a descriptive error message. No Great Expectations dependency — pure Python + psycopg2 assertions.

python — dag_data_quality.py (assertion pattern)

@task def assert_table_health(table: str) -> dict: with pg_conn() as conn: cur = conn.cursor() cur.execute(f""" SELECT COUNT(*) AS total_rows, COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '24h') AS new_rows, COUNT(*) FILTER (WHERE patient_id IS NULL) AS null_patient_ids FROM bronze.{table} """) total, new_rows, nulls = cur.fetchone() null_pct = (nulls / total * 100) if total > 0 else 0 if new_rows == 0: raise ValueError(f"{table}: 0 new rows in 24h — check ingestion pipeline") if null_pct > 1.0: raise ValueError(f"{table}: {null_pct:.1f}% null patient_id — exceeds threshold") return {'table': table, 'rows': total, 'null_pct': null_pct, 'status': 'pass'}

◈DAG ordering matters. The correct nightly sequence is: DAG 8 (quality check) → DAG 1 (bronze sync) → DAG 2 (FHIR extract) → DAG 3 (dbt run) → DAG 4 (ML scoring) → DAG 6/7 (alerts). This dependency is enforced via Airflow dataset sensors between DAG 3 and DAG 4.

Deploy dags/ to Airflow Required: copy files

yajur-sandbox/dags/ → Airflow dags/ volume mount

What to do

The Phase 2 docker-compose.yml mounts ./dags/ at the project root into the Airflow container. Copying the 8 DAG files there is sufficient — Airflow auto-discovers them within 30 seconds. No container restart required.

bash

# From yajur-sandbox/ root cp -r dags/*.py ./dags/ # Also copy the dbt directory into the Airflow dbt mount cp -r dbt/ ./airflow-dbt/ # Verify Airflow picks up the DAGs (wait 30 sec after copy) docker exec yajur-airflow-webserver airflow dags list | grep yajur # Should list all 8 yajur_* DAGs # Also copy support scripts cp scripts/sync_bronze.py ./scripts/ # The Docker Compose scripts/ volume makes this available inside the Airflow container

Trigger first DAG runs manually Required: unpause + trigger

http://localhost:8080 — Airflow UI · or airflow CLI

What to do

New DAGs arrive in Airflow in a paused state. You must unpause them and trigger an initial run before the schedules take over. Trigger in the correct order to avoid dependency failures.

①Unpause all 8 DAGs — either in the Airflow UI (toggle) or via CLI (see below).

②Trigger in order: DAG 8 → DAG 1 → DAG 2 → DAG 3 → DAG 4. Wait for green before triggering the next.

⚠DAG 3 (dbt run) takes 4–8 minutes. DAG 4 (ML scoring) depends on gold tables existing — do not trigger it until DAG 3 is green.

bash — CLI trigger sequence

# Unpause all yajur DAGs for dag in yajur_data_quality yajur_bronze_sync yajur_fhir_extraction \ yajur_dbt_gold_layer_refresh yajur_readmission_scoring \ yajur_aqi_ingestion yajur_oncology_alerts yajur_anc_followup; do docker exec yajur-airflow-webserver airflow dags unpause $dag done # Trigger manually in dependency order docker exec yajur-airflow-webserver airflow dags trigger yajur_data_quality # wait for green... docker exec yajur-airflow-webserver airflow dags trigger yajur_bronze_sync docker exec yajur-airflow-webserver airflow dags trigger yajur_fhir_extraction docker exec yajur-airflow-webserver airflow dags trigger yajur_dbt_gold_layer_refresh # wait for dbt run to complete (~5 min)... docker exec yajur-airflow-webserver airflow dags trigger yajur_readmission_scoring # Check run status docker exec yajur-airflow-webserver airflow dags list-runs --dag-id yajur_dbt_gold_layer_refresh

Verify — the 5 use case Trino queries running against live gold data

Verify: all 5 use case queries return data from iceberg.gold.* Final smoke test

http://localhost:8082 — Trino query editor · select iceberg catalog

What to run

Open Trino UI at localhost:8082. These are the 5 canonical use case queries — the same ones shown in the Navigator's Use Cases section. After Phase 4, they run against live Iceberg tables for the first time. A result set from each confirms the end-to-end pipeline is working.

SQL — UC-01: NCD Surveillance choropleth (run in Trino at localhost:8082)

-- UC-01: Disease burden by pin code — powers choropleth map SELECT pin_code, district, icd10_chapter, COUNT(*) AS total_visits, COUNT(DISTINCT patient_id) AS unique_patients, ROUND(AVG(age_years), 1) AS avg_patient_age FROM iceberg.gold.fct_opd_visits WHERE chronic_disease_flag = TRUE AND visit_year = YEAR(CURRENT_DATE) GROUP BY pin_code, district, icd10_chapter ORDER BY total_visits DESC LIMIT 50;

SQL — UC-02: Maternal gap tracker

-- UC-02: High-risk ANC mothers with missed follow-ups SELECT a.district, COUNT(*) AS high_risk_mothers, SUM(CASE WHEN a.last_visit_days_ago > 28 THEN 1 ELSE 0 END) AS overdue_followups, ROUND(100.0 * SUM(CASE WHEN a.last_visit_days_ago > 28 THEN 1 ELSE 0 END) / COUNT(*), 1) AS overdue_pct FROM iceberg.gold.fct_anc_registrations a WHERE a.high_risk = TRUE GROUP BY a.district ORDER BY overdue_followups DESC;

SQL — UC-03: Respiratory admissions × AQI

-- UC-03: Respiratory admissions correlated with AQI — 7-day lag SELECT a.admission_date, a.pin_code, COUNT(a.admission_id) AS respiratory_admissions, AVG(q.aqi_value) AS avg_aqi_7d_prior FROM iceberg.gold.fct_admissions a JOIN iceberg.gold.dim_patients p ON a.patient_id = p.patient_id JOIN postgresql.bronze.aqi_readings q ON q.pin_code = a.pin_code AND q.reading_date BETWEEN a.admission_date - INTERVAL '7' DAY AND a.admission_date WHERE a.primary_icd10 BETWEEN 'J00' AND 'J99' GROUP BY a.admission_date, a.pin_code ORDER BY a.admission_date DESC LIMIT 30;

SQL — UC-05: 30-day readmission risk

-- UC-05: High-risk discharges — top readmission candidates SELECT r.discharge_id, p.district, p.age_years, r.length_of_stay_days, r.charlson_score, r.comorbidity_tier, r.prior_90d_admissions, COALESCE(r.ml_readmission_score, -1) AS readmission_risk FROM iceberg.gold.fct_readmission_scores r JOIN iceberg.gold.dim_patients p ON r.patient_id = p.patient_id WHERE r.comorbidity_tier IN ('high', 'medium') AND r.discharge_date >= CURRENT_DATE - INTERVAL '30' DAY ORDER BY r.charlson_score DESC, r.prior_90d_admissions DESC LIMIT 25;

ℹUC-04 (Dengue early warning) uses the same fct_opd_visits table filtered to icd10_chapter = 'A' and grouped by week. The Navigator's Use Cases section contains the full UC-04 SQL — paste it directly into the Trino query editor.

◈

Phase 4 complete. Gold layer is live.

14 dbt models · 8 Airflow DAGs · 6 Iceberg gold tables · 5 clinical use case queries returning real data. The lakehouse is now a fully operational analytics platform.

Next: Phase 5 builds the "Under the Hood" React dashboard — real-time service health, animated data flow visualisation, embedded Trino SQL explorer, and Airflow DAG status monitor.

The Sandbox Navigator
for Hospital Intelligence

Three Layers.
One Unified Platform.

From Source to Clinical Signal

Nine Components.
Zero Vendor Lock-in.

Five Questions.
Five Dashboards.

Four Data Streams.
One Unified Schema.

Every Service.
Observable.

Six Phases.
One Working Lakehouse.

The lakehouse has
eyes on itself.

Every file.
Exact action required.

Five scripts.
A full bronze zone.

14 SQL models.
8 DAGs. Live gold tables.

The Sandbox Navigatorfor Hospital Intelligence

Three Layers.One Unified Platform.

From Source to Clinical Signal

Nine Components.Zero Vendor Lock-in.

Five Questions.Five Dashboards.

Four Data Streams.One Unified Schema.

Every Service.Observable.

Six Phases.One Working Lakehouse.

The lakehouse haseyes on itself.

Every file.Exact action required.

Five scripts.A full bronze zone.

14 SQL models.8 DAGs. Live gold tables.

The Sandbox Navigator
for Hospital Intelligence

Three Layers.
One Unified Platform.

Nine Components.
Zero Vendor Lock-in.

Five Questions.
Five Dashboards.

Four Data Streams.
One Unified Schema.

Every Service.
Observable.

Six Phases.
One Working Lakehouse.

The lakehouse has
eyes on itself.

Every file.
Exact action required.

Five scripts.
A full bronze zone.

14 SQL models.
8 DAGs. Live gold tables.