The Sandbox Navigator
for Hospital Intelligence
// Tier 3 · 300+ bed · Apache Iceberg · NiFi · Kafka · Spark · Trino · Superset · MLflow · HAPI FHIR
Three Layers.
One Unified Platform.
The Yajur Sandbox implements the full Tier 3 stack from the reference architecture: nine open source components assembled into a coherent, clinically purposeful data lakehouse. Every piece is containerised, every interface is documented, every data flow is traceable.
From Source to Clinical Signal
Every data point entering the lakehouse follows a deterministic path. NiFi extracts, Kafka streams, Spark processes, Iceberg persists, Trino queries, Superset renders. The flow is observable, auditable, and reproducible.
HAPI FHIR → Spark extracts Patient, Condition, Observation resources → written as Parquet into Iceberg tables → joinable with HIS data via ABHA IDMIMIC-IV and MC-MED datasets loaded via load_mimic.py → schema mapped to hospital canonical model → Delta/Iceberg tables for ICU, labs, diagnoses, prescriptionsgenerate_synthetic.py creates 50k patient records with Indian demographics, ICD-10 diagnoses, pin codes → Claude API generates clinical notes → all joined in the bronze layerNine Components.
Zero Vendor Lock-in.
Every service in the sandbox is open source, containerised, and pre-configured for the Yajur reference architecture. Click any component to see its role, configuration, and integration points.
localhost:9001 via the web console.opd-visits, lab-results, fhir-bundles, iot-vitals). Spark Structured Streaming consumes these topics and writes to Iceberg. Simulates real-time ward data flow.stg_patients, stg_opd_visits, fct_admissions, dim_diagnoses, mart_population_health. Full lineage graph included.readmission_model.ipynb walks through data prep, feature engineering, logistic regression training, and MLflow logging end to end.dev and main branch pre-configured.Five Questions.
Five Dashboards.
Each use case from the article is implemented as a complete data pipeline: NiFi flow → Kafka topic → Spark job → Iceberg table → dbt model → Trino query → Superset dashboard. Click to expand each use case and see the SQL.
deck.gl geospatial layer connected to Trino.SELECT
p.pin_code,
COUNT(DISTINCT v.patient_id) AS patients,
COUNT(*) AS total_visits,
ROUND(
COUNT(*)* 100.0 /
SUM(COUNT(*)) OVER(),2
) AS pct_of_total
FROM iceberg.gold.fct_opd_visits v
JOIN iceberg.gold.dim_patients p
ON v.patient_id = p.patient_id
WHERE (
v.icd10_code LIKE 'E11%' -- T2 Diabetes
OR v.icd10_code LIKE 'I10%' -- Hypertension
)
AND v.visit_date >= current_date - INTERVAL '12' MONTH
GROUP BY p.pin_code
ORDER BY patients DESC;
SELECT
p.patient_name,
p.mobile_number,
p.pin_code,
a.edd AS expected_delivery,
MAX(v.visit_date) AS last_anc_visit,
current_date - MAX(v.visit_date) AS days_overdue
FROM iceberg.gold.dim_patients p
JOIN iceberg.gold.fct_anc_registrations a
ON p.patient_id = a.patient_id
JOIN iceberg.gold.fct_opd_visits v
ON p.patient_id = v.patient_id
AND v.visit_type = 'ANC'
WHERE a.status = 'active'
GROUP BY p.patient_name, p.mobile_number,
p.pin_code, a.edd
HAVING MAX(v.visit_date) < current_date
- INTERVAL '28' DAY
ORDER BY days_overdue DESC;
SELECT
date_trunc('week', a.admission_date) AS week,
COUNT(*) AS respiratory_admissions,
AVG(q.aqi_value) AS avg_aqi,
MAX(q.aqi_value) AS peak_aqi
FROM iceberg.gold.fct_admissions a
JOIN iceberg.gold.dim_diagnoses d
ON a.primary_diagnosis_id = d.diagnosis_id
LEFT JOIN iceberg.silver.ext_aqi_daily q
ON CAST(a.admission_date AS DATE) = q.aqi_date
WHERE d.icd10_chapter = 'J'
-- Chapter J: Respiratory system
AND a.admission_date >= current_date
- INTERVAL '24' MONTH
GROUP BY 1
ORDER BY week;
WITH weekly_signals AS (
SELECT
p.pin_code,
date_trunc('week', v.visit_date) AS week,
COUNT(*) AS fever_cases,
COUNT(CASE WHEN
l.platelet_count < 100000
THEN 1 END) AS thrombocytopenia_cases
FROM iceberg.gold.fct_opd_visits v
JOIN iceberg.gold.dim_patients p ON v.patient_id = p.patient_id
LEFT JOIN iceberg.gold.fct_lab_results l ON v.visit_id = l.visit_id
WHERE v.presenting_complaint ILIKE '%fever%'
GROUP BY 1, 2
)
SELECT *, CASE
WHEN thrombocytopenia_cases >= 5 THEN '🔴 ALERT'
WHEN thrombocytopenia_cases >= 2 THEN '🟡 WATCH'
ELSE '🟢 NORMAL'
END AS alert_level
FROM weekly_signals
ORDER BY week DESC, fever_cases DESC;
SELECT
p.patient_name,
p.patient_id,
d.discharge_date,
d.primary_diagnosis_display,
r.risk_score,
r.risk_tier,
r.top_risk_factor
FROM iceberg.gold.fct_discharges d
JOIN iceberg.gold.dim_patients p
ON d.patient_id = p.patient_id
JOIN iceberg.gold.fct_readmission_scores r
ON d.discharge_id = r.discharge_id
WHERE d.discharge_date = current_date
AND r.risk_tier = 'HIGH'
ORDER BY r.risk_score DESC;
Four Data Streams.
One Unified Schema.
The sandbox ingests from four distinct sources — each reflecting a real-world hospital data modality. The bootstrap-data.sh script orchestrates all four loaders in sequence, normalising into a canonical hospital data model before the first Iceberg write.
load_mimic.py.generate_synthetic.py with Indian demographics, pin codes, ICD-10 diagnoses weighted to Indian epidemiology (diabetes, TB, typhoid, dengue). Clinical notes generated via Claude API with realistic discharge summary templates.load_fhir.py: Patient, Condition, Observation (vitals, labs), MedicationRequest, DiagnosticReport. Loaded into HAPI FHIR server. Extracted nightly by Spark and written to iceberg.silver.fhir_* tables.Every Service.
Observable.
The sandbox includes a live "Under the Hood" dashboard — a React UI showing all nine service statuses, the animated data flow, an embedded Trino SQL explorer, and a pipeline DAG view. Here's a preview of what it shows.
Six Phases.
One Working Lakehouse.
The sandbox is built and delivered in six sequential phases. Each phase has a concrete deliverable that can be demonstrated independently. Together they compose the complete Yajur Healthcare Data Lakehouse reference implementation.
docker-compose.yml for all 14 services — PostgreSQL, MinIO, Nessie, HAPI FHIR, Kafka, NiFi, Spark, Airflow, Trino, Superset, MLflow, Jupyter. Pre-configured with env variables, volume mounts, health checks, and boot ordering. One command to launch. Open runbook →generate_synthetic.py (50k Indian patients), load_mimic_demo.py (no credentials required), load_fhir.py (FHIR R4 bundles into HAPI), and generate_clinical_notes.py (Claude API for discharge summaries). Master bootstrap-data.sh orchestrates all four. Open runbook →The lakehouse has
eyes on itself.
The Under the Hood dashboard is live — service health, animated data flow, SQL explorer, and DAG monitor in one place. Switch to Phase 5 to see every component of the lakehouse in motion.