Yajur Healthcare Data Lakehouse — Reference Implementation

The Sandbox Navigator
for Hospital Intelligence

// Tier 3 · 300+ bed · Apache Iceberg · NiFi · Kafka · Spark · Trino · Superset · MLflow · HAPI FHIR

9
Open Source Components
5
Clinical Use Cases
6
Implementation Phases
₹0
Licensing Cost
yajur-sandbox — bash — 120×40
yajur@lakehouse:~$ docker compose -f yajur-tier3.yml up -d
[1/9] minio Started → http://localhost:9001 (OBJECT STORAGE)
[2/9] hapi-fhir Started → http://localhost:8080 (FHIR R4 SERVER)
[3/9] kafka Started → localhost:9092 (MESSAGE BROKER)
[4/9] nifi Started → http://localhost:8443 (DATA INGESTION)
[5/9] spark-master Started → http://localhost:8888 (DISTRIBUTED PROCESSING)
[6/9] airflow Started → http://localhost:8080 (ORCHESTRATION)
[7/9] trino Started → http://localhost:8082 (QUERY ENGINE)
[8/9] superset Started → http://localhost:8088 (DASHBOARDS)
[9/9] mlflow Started → http://localhost:5000 (ML PLATFORM)
◈ ALL SYSTEMS NOMINAL — Yajur Sandbox Lakehouse is live. Run ./bootstrap-data.sh to load MIMIC-IV + synthetic cohort.
yajur@lakehouse:~$ _

Three Layers.
One Unified Platform.

The Yajur Sandbox implements the full Tier 3 stack from the reference architecture: nine open source components assembled into a coherent, clinically purposeful data lakehouse. Every piece is containerised, every interface is documented, every data flow is traceable.

Layer 01
Storage
⬡ MinIO
❄ Apache Iceberg
⊛ Nessie Catalog
S3-compatible · ACID · Time-travel · Schema evolution
Layer 02
Ingestion
⇉ Apache NiFi
⚡ Apache Kafka
⊕ HAPI FHIR
HL7 · FHIR R4 · DICOM · CSV · REST · Streaming
Layer 03
Processing
⚙ Apache Spark
◈ dbt
⏱ Apache Airflow
Distributed · SQL transforms · Scheduled DAGs
Layer 04
Intelligence
▷ Trino
◉ Apache Superset
⬡ MLflow
◎ Jupyter
Ad-hoc SQL · Dashboards · Model registry

From Source to Clinical Signal

Every data point entering the lakehouse follows a deterministic path. NiFi extracts, Kafka streams, Spark processes, Iceberg persists, Trino queries, Superset renders. The flow is observable, auditable, and reproducible.

🏥
HIS / EMR
Source Systems
NiFi
Ingestion
Kafka
Streaming
Spark
Processing
Iceberg
Table Format
Trino
Query Engine
Superset
Dashboards
FHIR Path
ABDM/ABHA bundles enter via HAPI FHIR → Spark extracts Patient, Condition, Observation resources → written as Parquet into Iceberg tables → joinable with HIS data via ABHA ID
PhysioNet Path
MIMIC-IV and MC-MED datasets loaded via load_mimic.py → schema mapped to hospital canonical model → Delta/Iceberg tables for ICU, labs, diagnoses, prescriptions
Synthetic Path
generate_synthetic.py creates 50k patient records with Indian demographics, ICD-10 diagnoses, pin codes → Claude API generates clinical notes → all joined in the bronze layer

Nine Components.
Zero Vendor Lock-in.

Every service in the sandbox is open source, containerised, and pre-configured for the Yajur reference architecture. Click any component to see its role, configuration, and integration points.

MinIO
Object Storage · S3-Compatible
The foundation layer. All raw data, Parquet files, and Iceberg table data lives in MinIO. Runs as a distributed cluster in production; single-node in the sandbox. Accessible at localhost:9001 via the web console.
port :9000 · :9001
Apache Iceberg
Table Format · ACID · Time Travel
The table format layer that turns raw Parquet files in MinIO into ACID-compliant, version-controlled tables. Iceberg provides schema evolution, partition pruning, and time-travel queries — critical for clinical data auditing.
Nessie catalog · Spark integration
HAPI FHIR
FHIR R4 Server · ABDM Integration
Open source reference implementation of HL7 FHIR R4. Receives FHIR bundles from ABDM Health Information Exchange. Stores Patient, Condition, Observation, and DiagnosticReport resources. REST API for querying and validation.
port :8080 · FHIR R4
Apache NiFi
Data Ingestion · Flow-Based ETL
Visual, flow-based data integration. In the sandbox, NiFi flows connect to the synthetic HIS database, pull CSV exports, consume FHIR REST endpoints, and route data to Kafka topics. Pre-built flows for all 5 use case data sources included.
port :8443 · pre-built flows
Apache Kafka
Message Broker · Event Streaming
Decouples ingestion from processing. NiFi writes events to Kafka topics (opd-visits, lab-results, fhir-bundles, iot-vitals). Spark Structured Streaming consumes these topics and writes to Iceberg. Simulates real-time ward data flow.
port :9092 · 4 topics
Apache Spark
Distributed Processing · Batch + Streaming
Processes MIMIC-IV datasets, runs FHIR extraction jobs, executes dbt transformations at scale, and runs the readmission ML training pipeline. Configured with the Iceberg catalog for native table writes. Spark UI available for job monitoring.
port :8888 · :4040 (Spark UI)
dbt Core
Data Transformation · SQL Modelling
Transforms raw data in the bronze layer into clean, enriched silver and gold tables. Pre-built models: stg_patients, stg_opd_visits, fct_admissions, dim_diagnoses, mart_population_health. Full lineage graph included.
bronze → silver → gold
Trino
Query Engine · Federated SQL
Federated query engine that runs SQL across Iceberg tables in MinIO, HAPI FHIR REST endpoints, and the PostgreSQL metadata store — simultaneously, in a single query. All 5 use case queries are pre-loaded. Web UI available for ad-hoc exploration.
port :8082 · federated queries
Apache Superset
Dashboards · Visualisation
5 pre-built dashboards — one for each clinical use case. Geographic heat maps (pin code choropleth), time-series trend charts, readmission risk tables, ANC gap trackers, and dengue early warning panels. Superset connects to Trino as its SQL engine.
port :8088 · 5 dashboards
MLflow + Jupyter
ML Platform · Experiment Tracking
MLflow tracks experiments, registers the readmission risk model, and serves the model via a REST endpoint. Jupyter provides the development notebook — a pre-built readmission_model.ipynb walks through data prep, feature engineering, logistic regression training, and MLflow logging end to end.
MLflow :5000 · Jupyter :8889
Nessie Catalog
Data Catalog · Git-like Versioning
Nessie gives your Iceberg tables Git-like branching. Create a branch, run experimental transformations, and merge only if results are correct — without touching production data. Integrates natively with Spark, Trino, and dbt. The sandbox includes a dev and main branch pre-configured.
port :19120 · REST catalog

Five Questions.
Five Dashboards.

Each use case from the article is implemented as a complete data pipeline: NiFi flow → Kafka topic → Spark job → Iceberg table → dbt model → Trino query → Superset dashboard. Click to expand each use case and see the SQL.

UC-01
Neighbourhood Diabetes & Hypertension Surveillance
Surveillance
+
What it shows
A geographic heat map of diabetes and hypertension diagnosis density by pin code across the hospital's service area, updated nightly. Identifies clusters for community wellness camp targeting. Uses CPCB AQI data as an environmental enrichment layer.
Superset Dashboard
Choropleth map (OpenStreetMap) + bar chart by age group + trend line over 24 months. Powered by Superset's deck.gl geospatial layer connected to Trino.
Data Source
Synthetic OPD visits (50k patients) · MIMIC-IV diagnosis codes mapped to ICD-10 · Census 2011 pin code boundaries (GeoJSON)
Trino SQL Query
-- Diagnosis burden by pin code (Trino / Iceberg)
SELECT
  p.pin_code,
  COUNT(DISTINCT v.patient_id) AS patients,
  COUNT(*) AS total_visits,
  ROUND(
    COUNT(*)* 100.0 /
    SUM(COUNT(*)) OVER(),2
  ) AS pct_of_total
FROM iceberg.gold.fct_opd_visits v
JOIN iceberg.gold.dim_patients p
  ON v.patient_id = p.patient_id
WHERE (
  v.icd10_code LIKE 'E11%' -- T2 Diabetes
  OR v.icd10_code LIKE 'I10%' -- Hypertension
)
AND v.visit_date >= current_date - INTERVAL '12' MONTH
GROUP BY p.pin_code
ORDER BY patients DESC;
UC-02
Maternal & Child Health Gap Identification
Maternal Health
+
What it shows
A daily list of registered pregnant women who have missed their scheduled ANC visits, mothers who delivered but haven't returned for 6-week post-natal check, and infants due for immunisation. Feeds a daily call list for frontline staff.
Superset Dashboard
Three-panel layout: ANC completion funnel, post-natal gap tracker (days since delivery vs. last visit), immunisation due list sortable by pin code. Designed for the nursing station screen.
Data Source
Synthetic ANC registration records · Delivery records · MIMIC-IV maternal cohort · RCH programme indicator schema
Trino SQL Query
-- ANC-registered women: missed visits in last 4 weeks
SELECT
  p.patient_name,
  p.mobile_number,
  p.pin_code,
  a.edd AS expected_delivery,
  MAX(v.visit_date) AS last_anc_visit,
  current_date - MAX(v.visit_date) AS days_overdue
FROM iceberg.gold.dim_patients p
JOIN iceberg.gold.fct_anc_registrations a
  ON p.patient_id = a.patient_id
JOIN iceberg.gold.fct_opd_visits v
  ON p.patient_id = v.patient_id
  AND v.visit_type = 'ANC'
WHERE a.status = 'active'
GROUP BY p.patient_name, p.mobile_number,
         p.pin_code, a.edd
HAVING MAX(v.visit_date) < current_date
        - INTERVAL '28' DAY
ORDER BY days_overdue DESC;
UC-03
Seasonal Respiratory Clustering & Capacity Preparation
Respiratory
+
What it shows
Weekly respiratory admission counts overlaid with CPCB AQI data from the local monitoring station. Identifies AQI thresholds that historically precede admission spikes by 7–14 days, enabling proactive capacity and medication stock preparation.
Superset Dashboard
Dual-axis time-series chart (admissions vs. AQI) + admission spike alert panel + pre-computed 14-day AQI forecast correlation. Built using Superset's ECharts integration.
Data Source
MIMIC-IV respiratory cohort (COPD, Asthma, Pneumonia ICD codes) · CPCB AQI API (live + historical) · Synthetic OPD respiratory visits
Trino SQL Query
-- Weekly respiratory admissions + AQI correlation
SELECT
  date_trunc('week', a.admission_date) AS week,
  COUNT(*) AS respiratory_admissions,
  AVG(q.aqi_value) AS avg_aqi,
  MAX(q.aqi_value) AS peak_aqi
FROM iceberg.gold.fct_admissions a
JOIN iceberg.gold.dim_diagnoses d
  ON a.primary_diagnosis_id = d.diagnosis_id
LEFT JOIN iceberg.silver.ext_aqi_daily q
  ON CAST(a.admission_date AS DATE) = q.aqi_date
WHERE d.icd10_chapter = 'J'
  -- Chapter J: Respiratory system
AND a.admission_date >= current_date
    - INTERVAL '24' MONTH
GROUP BY 1
ORDER BY week;
UC-04
Vector-Borne Disease Early Warning
Outbreak Detection
+
What it shows
Weekly syndromic surveillance: counts of fever presentations with low platelet counts by residential pin code. A cluster forming in any pin code triggers an alert 10–14 days before it would register in district surveillance. Blood bank platelet stock is shown alongside alert levels.
Superset Dashboard
Alert heat map by pin code + rolling 4-week fever trend + platelet demand forecast panel. Colour-coded alert thresholds (green/amber/red) based on historical dengue season baselines.
Data Source
Synthetic fever OPD records with platelet lab values · MIMIC-IV labs (CBC panel mapped to LOINC) · Blood bank inventory table
Trino SQL Query
-- Dengue early warning: fever + thrombocytopenia clusters
WITH weekly_signals AS (
  SELECT
    p.pin_code,
    date_trunc('week', v.visit_date) AS week,
    COUNT(*) AS fever_cases,
    COUNT(CASE WHEN
      l.platelet_count < 100000
    THEN 1 END) AS thrombocytopenia_cases
  FROM iceberg.gold.fct_opd_visits v
  JOIN iceberg.gold.dim_patients p ON v.patient_id = p.patient_id
  LEFT JOIN iceberg.gold.fct_lab_results l ON v.visit_id = l.visit_id
  WHERE v.presenting_complaint ILIKE '%fever%'
  GROUP BY 1, 2
)
SELECT *, CASE
  WHEN thrombocytopenia_cases >= 5 THEN '🔴 ALERT'
  WHEN thrombocytopenia_cases >= 2 THEN '🟡 WATCH'
  ELSE '🟢 NORMAL'
END AS alert_level
FROM weekly_signals
ORDER BY week DESC, fever_cases DESC;
UC-05
30-Day Readmission Risk Prediction
ML Prediction
+
What it shows
At the time of discharge, every patient receives a 30-day readmission risk score (0–100) generated by a logistic regression model trained on MIMIC-IV. High-risk patients (score ≥ 70) appear in the daily "Follow-up Priority" list for the discharge planning team.
ML Pipeline
Trained in Jupyter on MIMIC-IV admissions data. Features: age, LOS, comorbidity count, Charlson score, discharge disposition, prior 90-day admissions. Registered in MLflow model registry. Served via REST endpoint. Predictions written back to Iceberg nightly via Airflow DAG.
Data Source
MIMIC-IV admissions + diagnoses + lab values (PhysioNet credentialed) · Charlson Comorbidity Index computed via dbt model · MLflow experiment tracking
Trino SQL Query
-- Today's high-risk discharge cohort
SELECT
  p.patient_name,
  p.patient_id,
  d.discharge_date,
  d.primary_diagnosis_display,
  r.risk_score,
  r.risk_tier,
  r.top_risk_factor
FROM iceberg.gold.fct_discharges d
JOIN iceberg.gold.dim_patients p
  ON d.patient_id = p.patient_id
JOIN iceberg.gold.fct_readmission_scores r
  ON d.discharge_id = r.discharge_id
WHERE d.discharge_date = current_date
  AND r.risk_tier = 'HIGH'
ORDER BY r.risk_score DESC;

Four Data Streams.
One Unified Schema.

The sandbox ingests from four distinct sources — each reflecting a real-world hospital data modality. The bootstrap-data.sh script orchestrates all four loaders in sequence, normalising into a canonical hospital data model before the first Iceberg write.

🏛
MIMIC-IV
PhysioNet · Clinical Database
47k ICU patients. Full admission records, lab values (CBC, metabolic panel, cultures), diagnoses (ICD-10-CM), procedures, prescriptions, and nursing notes. The richest open clinical dataset available. Maps to Iceberg bronze tables via load_mimic.py.
Credentialed Access
🧬
MIMIC-IV Demo
PhysioNet · No Credentialing Required
100-patient subset of MIMIC-IV with no credentialing required. Available immediately. Used for the initial sandbox bootstrap so teams can start without waiting for PhysioNet access. Identical schema to the full dataset.
Free — No Credentials
🇮🇳
Synthetic Cohort
Python Generated · India-Specific
50,000 patients generated by generate_synthetic.py with Indian demographics, pin codes, ICD-10 diagnoses weighted to Indian epidemiology (diabetes, TB, typhoid, dengue). Clinical notes generated via Claude API with realistic discharge summary templates.
AI-Generated
FHIR R4 Bundles
HL7 FHIR · ABDM-Compatible
Synthetic FHIR R4 bundles generated by load_fhir.py: Patient, Condition, Observation (vitals, labs), MedicationRequest, DiagnosticReport. Loaded into HAPI FHIR server. Extracted nightly by Spark and written to iceberg.silver.fhir_* tables.
FHIR Synthetic

Every Service.
Observable.

The sandbox includes a live "Under the Hood" dashboard — a React UI showing all nine service statuses, the animated data flow, an embedded Trino SQL explorer, and a pipeline DAG view. Here's a preview of what it shows.

Service Health Monitor
minio
RELEASE.2024-01
:9000 · :9001
hapi-fhir
7.2.0
:8080
kafka + zookeeper
3.7.0
:9092
apache-nifi
2.0.0
:8443
spark-master
3.5.1
:8888 · :4040
airflow-webserver
2.9.2
:8085
trino
443
:8082
superset
4.0.1
:8088
mlflow
2.13.0
:5000
Airflow DAG Monitor
yajur_diabetes_surveillance
@daily
✓ 2h ago
yajur_maternal_gap_tracker
@daily
✓ 2h ago
yajur_respiratory_aqi_sync
@hourly
✓ 42m ago
yajur_dengue_early_warning
@daily
✓ 2h ago
yajur_readmission_scoring
@daily
⟳ running
yajur_fhir_to_iceberg_sync
@hourly
✓ 18m ago
yajur_dbt_gold_layer_refresh
@daily
✓ 2h ago
yajur_mimic_initial_load
@once
✓ completed

Six Phases.
One Working Lakehouse.

The sandbox is built and delivered in six sequential phases. Each phase has a concrete deliverable that can be demonstrated independently. Together they compose the complete Yajur Healthcare Data Lakehouse reference implementation.

1
Current · Phase 1
Sandbox Navigator
Week 1
This interactive guide — the mission control for the entire sandbox. Architecture overview, component documentation, use case SQL library, and implementation roadmap. Shareable with hospital CTOs and investors as a standalone demo.
yajur-sandbox-navigator.html
2
✓ Complete
Docker Compose Stack
Week 1–2
Complete docker-compose.yml for all 14 services — PostgreSQL, MinIO, Nessie, HAPI FHIR, Kafka, NiFi, Spark, Airflow, Trino, Superset, MLflow, Jupyter. Pre-configured with env variables, volume mounts, health checks, and boot ordering. One command to launch. Open runbook →
docker-compose.yml · bootstrap.sh · 6 config files
3
✓ Complete
Data Generation Layer
Week 2
Four data loaders: generate_synthetic.py (50k Indian patients), load_mimic_demo.py (no credentials required), load_fhir.py (FHIR R4 bundles into HAPI), and generate_clinical_notes.py (Claude API for discharge summaries). Master bootstrap-data.sh orchestrates all four. Open runbook →
bootstrap-data.sh + 4 loaders
4
✓ Complete
dbt Models + Airflow DAGs
Week 2–3
Full dbt project: staging models (bronze), intermediate transforms (silver), and 5 analytical marts (gold) — one per use case. Eight Airflow DAGs including nightly HIS sync, FHIR extraction, dbt run, readmission scoring, and CPCB AQI ingestion. Full lineage documentation. Open runbook →
dbt/ + dags/ directories
5
✓ Complete
"Under the Hood" Dashboard
Week 3
Live infrastructure dashboard embedded in the Navigator: service health monitor with boot sequence animation (9 services), animated SVG data flow diagram with particle effects, SQL query explorer with mock Trino results for all 5 use cases, and Airflow DAG view with expandable task graphs and Gantt history. Open dashboard →
Embedded in yajur-sandbox-navigator.html
6
Phase 6
Implementation Guide
Week 3–4
Full Word + PDF documentation: architecture decision records, component configuration reference, PhysioNet/MIMIC-IV dataset access guide (credentialed and demo paths), Superset dashboard setup walkthrough, MLflow model training notebook guide, and production hardening checklist for hospital IT teams.
Yajur-Sandbox-Guide-v1.docx
// Phase 5 Complete

The lakehouse has
eyes on itself.

The Under the Hood dashboard is live — service health, animated data flow, SQL explorer, and DAG monitor in one place. Switch to Phase 5 to see every component of the lakehouse in motion.

Read the Article ↗