Playbook: Alex for Data Engineers

Your reference for applying Alex to pipeline design, data modeling, quality engineering, lakehouse architecture, and ETL/ELT optimization. Ready-to-run prompts — built around the hard parts of production data systems, not the tutorial datasets.

What This Guide Is Not

This is not a habit formation guide (see Self-Study Guide for that). This is a domain use-case library — the specific ways Alex supports professional data engineering work.

Where to Practice These Prompts

Every prompt in this guide works with any AI assistant you already use — GitHub Copilot, ChatGPT, Claude, Gemini, or others. The prompts are the skill; the tool is just where you type them. If you already have a preferred tool, start there.

For the deepest experience, the Alex VS Code extension (free) was built for these workflows. It understands data engineering context, lets you save what works with /saveinsight, and keeps your playbook and exercises right inside the editor where you already work.

You don’t need a specific tool to benefit. You need the discipline of reaching for AI when the work is genuinely hard — not just when it’s repetitive.

Core Principle for Data Engineers

Data engineering is the discipline of building systems that make data reliably available for decisions. The hardest part is not writing the transformation — it is understanding the upstream data well enough to know when it changes, building pipelines robust enough to handle the changes gracefully, and communicating clearly enough that downstream consumers trust the data.

Your primary discipline with Alex: use it to think through failure modes, validate your modeling decisions, and document the assumptions that are invisible to you but critical for everyone who depends on your pipelines.

The Seven Use Cases

1. Pipeline Architecture and Design

The data engineer’s architecture challenge: Pipeline architecture decisions compound. A denormalization choice in the bronze layer constrains what the silver layer can do. A partitioning strategy that works at 1TB fails at 100TB. The engineer who designs pipelines well thinks about the steady state and the failure state with equal rigor.

Prompt pattern:

I am designing a data pipeline for [use case].
Source systems: [list with volumes, frequencies, formats].
Destination: [lakehouse / warehouse / feature store / API].
SLAs: [freshness, completeness, latency requirements].
Current pain points: [what is broken or expensive today].
Technology stack: [Spark / Fabric / Databricks / Synapse / Airflow / dbt].

Help me:
1. Identify the failure modes I have not accounted for (late data, schema drift, duplicates)
2. Design the retry and dead-letter strategy
3. Evaluate whether my partitioning strategy survives 10x growth
4. Recommend the medallion layer boundaries for this use case

Follow-up prompts:

What happens to this pipeline when the source schema changes without notice? Design a schema evolution strategy.

My pipeline runs daily but the business wants hourly. What changes structurally — not just the schedule but the architecture?

What monitoring should I build so that when the source schema changes, I know within minutes — not when the downstream dashboard breaks?

Try this now: You are designing a real-time clickstream pipeline for an e-commerce site processing 50M events per day. Data must reach the lakehouse within 5 minutes for recommendation models and within 24 hours for analytics dashboards. Paste your architecture sketch into the prompt and ask for failure mode analysis — the response will surface partitioning, backpressure, and schema evolution issues you may not have considered.

2. Data Modeling and Lakehouse Design

The data engineer’s modeling challenge: Data modeling is the practice of making explicit decisions about how data is organized, related, and accessed. The failure mode is skipping the modeling step entirely — loading raw data into a lakehouse with no structure and calling it a “schema-on-read strategy” when it is actually “no strategy.” Good modeling balances query performance, storage efficiency, and the ability to evolve as requirements change.

Prompt pattern:

I need to model [data domain: customer, product, transaction, event, IoT telemetry].
Source shape: [describe the raw data — nested JSON, flat CSV, CDC stream, API response].
Primary consumers: [analysts, ML models, dashboards, APIs].
Query patterns: [what questions will be asked — aggregations, lookups, time series].
Current scale: [rows/day, total size, growth rate].

Help me:
1. Choose between star schema, snowflake, wide table, or activity schema for this use case
2. Define the grain — what does one row represent?
3. Identify slowly changing dimensions and the appropriate SCD type
4. Design the medallion layer transformations (bronze → silver → gold)

Follow-up prompts:

What happens when the grain of this table needs to change in six months — how do I design for that now?

How do I handle late-arriving dimensions that invalidate joins already materialized in the gold layer?

What is the tradeoff between wide denormalized tables for query speed and normalized tables for storage and consistency in this use case?

3. Data Quality Engineering

The data engineer’s quality challenge: Data quality is not a testing problem — it is a systems problem. The failure mode is adding validation checks after problems occur rather than designing quality into the pipeline from the start. Quality engineering means defining expectations, measuring conformance, and building feedback loops that surface problems before they reach dashboards.

Prompt pattern:

I need to build data quality checks for [pipeline/table/domain].
Known issues: [what has gone wrong before — nulls, duplicates, late arrivals, type coercion].
Business rules: [constraints the data must satisfy — e.g., revenue > 0, dates in range, referential integrity].
Current tooling: [Great Expectations / dbt tests / custom / none].
Downstream impact: [what breaks when quality fails — reports, ML models, customer-facing data].

Help me:
1. Categorize checks by type: completeness, accuracy, consistency, timeliness, uniqueness
2. Prioritize by downstream impact — which failures matter most?
3. Design the alerting and remediation workflow (not just "test fails, someone checks Slack")
4. Build a data contract for this pipeline's output

Follow-up prompts:

How do I build data quality checks that do not slow down the pipeline unacceptably? Where should I sample vs. check everything?

I inherited a pipeline with no quality checks and no documentation. What is the fastest path to basic coverage?

4. ETL/ELT Optimization and Performance

The data engineer’s performance challenge: Performance optimization in data pipelines is different from application performance. The bottleneck is rarely CPU — it is I/O, shuffle, skew, and serialization. The engineer who optimizes well understands the execution plan, not just the code.

Prompt pattern:

My pipeline is too slow / too expensive:
Current runtime: [duration].
Data volume: [input size, output size].
Processing: [Spark / SQL / Python / dbt].
Bottleneck symptoms: [spill to disk, OOM, single-partition hotspot, full shuffle].
What I have tried: [repartitioning, caching, broadcast join, etc.].

Help me:
1. Diagnose the most likely bottleneck from these symptoms
2. Identify whether this is a data skew, partition strategy, or join strategy problem
3. Recommend specific optimizations with expected impact
4. Design a performance test to validate the fix before deploying

Follow-up prompts:

How do I determine whether this is a data skew problem or a partition strategy problem from the Spark UI alone?

What are the most common Spark anti-patterns that look fine at small scale but fail at production volumes?

Design a benchmark that measures pipeline cost-per-row so I can track optimization over time.

5. Orchestration and Pipeline Operations

The data engineer’s operations challenge: A pipeline that runs perfectly in development and fails silently in production is not a pipeline — it is a liability. The gap between “works on my machine” and “runs reliably at 3 AM with no one watching” is where data engineering maturity lives.

Prompt pattern:

I need to design orchestration for [pipeline/set of pipelines].
Dependencies: [list DAG relationships — what must complete before what].
Schedule: [frequency, acceptable delay, business SLAs].
Failure modes: [what has failed before and how it was resolved].
Current orchestrator: [Airflow / Fabric / ADF / Prefect / custom cron].

Help me:
1. Design the DAG with appropriate retry and timeout policies
2. Identify the monitoring gaps — what fails silently today?
3. Build the alerting strategy (who gets paged, when, with what context)
4. Design the recovery playbook — what manual steps are needed when auto-retry fails?

Follow-up prompts:

What is the difference between a retry that fixes the problem and a retry that masks the problem — how do I design alerting that distinguishes them?

How should I handle DAG dependencies when an upstream pipeline delivers partial data — run with what I have or wait?

Design the on-call runbook for this pipeline — what does the engineer at 3 AM need to know to triage without deep context?

6. Schema Evolution and Data Contracts

The data engineer’s contract challenge: Every pipeline has an implicit contract: “I will deliver data in this shape, at this time, with this quality.” When the contract is implicit, it breaks silently. When the contract is explicit, breaking changes are visible and negotiable.

Prompt pattern:

I need to manage schema changes for [pipeline/data product].
Current schema: [describe or paste].
Proposed change: [what is changing — new column, type change, removed field, renamed].
Downstream consumers: [who reads this data and how].
Versioning strategy: [none / additive-only / explicit versions].

Help me:
1. Classify this change: backward-compatible, breaking, or negotiable
2. Design the migration path that does not break downstream consumers
3. Draft a data contract that makes future changes visible before they deploy
4. Recommend a schema registry or evolution strategy for this stack

Follow-up prompts:

How do I enforce a data contract without creating a deployment bottleneck where every schema change requires cross-team approval?

What is the minimum viable data contract for a pipeline that currently has no documentation — where do I start?

Design the alerting that fires when a producer violates the contract before the data reaches consumers.

7. Documentation and Knowledge Transfer

The data engineer’s documentation challenge: Pipeline documentation is almost always wrong because the code evolves faster than the docs. The documentation that matters is not “what this pipeline does” (read the code) but “why it does it this way,” “what breaks if you change this,” and “who to call when it fails at 3 AM.”

Prompt pattern:

I need to document [pipeline/system/data product].
What it does: [plain English summary].
Why it exists: [business justification].
What goes wrong: [honest failure history].
Who depends on it: [downstream consumers].
Who maintains it: [team, on-call rotation].

Draft documentation that answers:
1. What someone needs to know to operate this safely (runbook)
2. What someone needs to know to modify this safely (design context)
3. What the on-call engineer needs at 3 AM (triage guide)

Follow-up prompts:

What is the single most important thing to document about this pipeline that the code alone cannot tell you?

How do I keep this documentation from going stale — what process or automation ensures it stays current?

Write the one-paragraph summary a new team member reads on their first day to understand what this pipeline does and why it matters.

What Great Looks Like

After consistent use, you should notice:

Pipeline failures are anticipated and handled gracefully — not discovered by downstream consumers
Data models are designed with explicit tradeoffs documented, not just “whatever the source looks like”
Quality checks prevent bad data from reaching production instead of detecting it after the fact
Schema changes are communicated through contracts, not surprises
Documentation reflects operational reality, not the state of the system when it was first built

The data engineer who will thrive in an AI-augmented environment is not the one who writes Spark jobs fastest. It is the one who builds systems that are reliable, observable, and honestly documented.

Your AI toolkit: These prompts work in ChatGPT, Claude, Copilot, Gemini — and in the Alex VS Code extension, which was designed around them. Start with whatever you have. The skill transfers across all of them.

Your First Week Back: Practice Plan

Day	Task	Time
Day 1	Use the Data Quality pattern on your most critical pipeline	25 min
Day 2	Write a data contract for one pipeline’s output schema	25 min
Day 3	Run the Pipeline Architecture pattern on a planned or struggling pipeline	25 min
Day 4	Use the Performance pattern on your slowest job	25 min
Day 5	Review the week’s prompts — save your three best with /saveinsight	25 min

Month 2–3: Advanced Applications

Track Your Growth

Pipeline Incident Archive

Capture failure patterns to speed future diagnosis:

/saveinsight title="Pipeline incident: [symptom]" insight="Pipeline: [name]. Symptom: [describe]. Root cause: [what actually broke]. Fix: [what resolved it]. Prevention: [what would catch this earlier]." tags="data-engineering,incident,pipeline"

Data Model Decision Log

Track modeling decisions with their rationale:

/saveinsight title="Model decision: [domain]" insight="Grain: [what one row represents]. Pattern: [star/snowflake/wide/activity]. Tradeoff: [what we optimized for and what we gave up]. Revisit if: [conditions]." tags="data-engineering,modeling"

Continue your practice: Self-Study Guide — the 30/60/90-day habit guide.

Data Engineering

Playbook: Alex for Data Engineers

What This Guide Is Not

Where to Practice These Prompts

Core Principle for Data Engineers

The Seven Use Cases

1. Pipeline Architecture and Design

2. Data Modeling and Lakehouse Design

3. Data Quality Engineering

4. ETL/ELT Optimization and Performance

5. Orchestration and Pipeline Operations

6. Schema Evolution and Data Contracts

7. Documentation and Knowledge Transfer

What Great Looks Like

Your First Week Back: Practice Plan

Month 2–3: Advanced Applications

Sign in to LearnAI