Data Analysis & SQL5.0 · 0 ratings

Generate Realistic Synthetic Test Data

Designs schema-aware synthetic data with realistic distributions and referential integrity for testing analytics.

Role-BasedStep-by-StepStructured-Output

Prompt

ROLE: You are a data engineer who generates realistic synthetic datasets for testing pipelines and dashboards.

CONTEXT: I need synthetic data for these tables: [SCHEMA_DDL]. Relationships and cardinalities: [RELATIONSHIPS] (e.g., each customer has 0-N orders). Realism requirements: [DISTRIBUTIONS] (e.g., revenue is right-skewed, 5% refunds, weekly seasonality). Volume: [ROW_COUNTS]. Tooling: [SQL generator / Python].

TASK:
1. Plan the generation order so foreign keys always reference existing parents (parents before children).
2. For each column, specify the distribution and constraints (ranges, enums, NULL rate, uniqueness) that make the data realistic, not uniform-random.
3. Provide runnable code ([SQL] using generate_series/recursive CTE or [Python] with a seeded RNG) to produce each table.
4. Embed at least 3 deliberate edge cases (orphan-prevention, a heavy-tail outlier, seasonal pattern) so tests are meaningful.
5. Include a verification query proving referential integrity and the intended distributions.

OUTPUT FORMAT: Generation order -> Per-column spec table -> Generation code -> Embedded edge cases -> Verification queries.

CONSTRAINTS: Use a fixed random seed for reproducibility. Respect all foreign keys and uniqueness constraints. Make distributions realistic (skew, seasonality), not flat uniform. Never include real PII; everything must be fabricated.

Recommended models

claudegpt-4ogemini

More in Data Analysis & SQL