Composable Synthetic Data Engine: Architecture, Stack & Delivery
Designed engine using metadata models, ERDs, and profiling rules to generate composable synthetic datasets. Built with Python, Metadata models, ERD tooling.
By Yogendra Raghuvanshi
Introduction
In this article I break down how I designed and delivered Composable Synthetic Data Engine — from the original business pain point through architecture, technology choices, implementation phases, and lessons learned. This is the same project featured in my portfolio's Built Solutions section, documented here in full technical depth for engineers, architects, and hiring managers who want to understand how the work was actually done.
I led this initiative as part of my broader program delivery work across enterprise AI, data platforms, and analytics transformation. The approach reflects how I operate: start with the business outcome, choose the minimum viable architecture, instrument everything, and iterate with real users.
Business problem
QA, testing, and LLM training needed realistic data without production exposure.
Designed engine using metadata models, ERDs, and profiling rules to generate composable synthetic datasets.
Architecture decisions
Key design choices that shaped reliability, performance, and maintainability of the solution.
- Profiling drives distributions (dates, enums, numeric ranges)
- Referential integrity enforced across related tables
- No production data copied-only statistical fingerprints
Technology stack in depth
This project was built with Python, Metadata models, ERD tooling. Each technology was selected for a specific role in the architecture — not because it was trendy, but because it solved a measured bottleneck.
- Python: production component with documented integration patterns and operational runbooks
- Metadata models: production component with documented integration patterns and operational runbooks
- ERD tooling: production component with documented integration patterns and operational runbooks
Implementation timeline
Delivery followed phased milestones with explicit deliverables at each gate. This kept stakeholders aligned and made progress auditable for program reviews.
- Metadata ingestion (2 weeks): ERD import and column profiling from sample extracts.
- → Metadata schema
- → Profiler jobs
- → Constraint map
- Generator core (3 weeks): Composable generators per column type with referential integrity.
- → Generator library
- → FK resolver
- → Volume controls
- Consumer integrations (2 weeks): QA automation and LLM dataset export formats.
- → CSV/Parquet export
- → CI integration
- → Usage guide
Metadata-driven generation
The engine ingests ERDs and column profiles from sample extracts — never full production copies. Profiling captures distributions (date ranges, enum frequencies, numeric min/max) that drive realistic generators per column type.
Referential integrity is enforced across related tables via a FK resolver that generates parent rows before children.
- Column types: enum sampler, date range generator, numeric distribution fit
- Volume controls per table for QA scenarios vs LLM training scale
- Export formats: CSV, Parquet, and JSONL for downstream consumers
- CI integration: synthetic datasets generated on every release candidate
Privacy and compliance
No production data is copied — only statistical fingerprints. This satisfied security review for QA environments and LLM fine-tuning datasets without exposure risk.
Business outcomes
Enabled safer testing and richer LLM training datasets.
Success was measured against adoption, latency/throughput targets, and stakeholder feedback — not just deployment dates. Program reviews tracked these KPIs alongside technical milestones.
Lessons learned
Profiling rules must mirror production distributions, not random fills.
If I were starting again, I would invest even earlier in observability and golden test sets. The cost of retrofitting guardrails after pilot launch always exceeds building them in from day one.