Study Guide: Alex for Data Engineers

Your reference for applying Alex to pipeline design, data modeling, quality engineering, lakehouse architecture, and ETL/ELT optimization. Ready-to-run prompts — built around the hard parts of production data systems, not the tutorial datasets.


What This Guide Is Not

This is not a habit formation guide (see Self-Study Guide for that). This is a domain use-case library — the specific ways Alex supports professional data engineering work.


Where to Practice These Prompts

Every prompt in this guide works with any AI assistant you already use — GitHub Copilot, ChatGPT, Claude, Gemini, or others. The prompts are the skill; the tool is just where you type them. If you already have a preferred tool, start there.

For the deepest experience, the Alex VS Code extension (free) was built for these workflows. It understands data engineering context, lets you save what works with /saveinsight, and keeps your study guide and exercises right inside the editor where you already work.

You don’t need a specific tool to benefit. You need the discipline of reaching for AI when the work is genuinely hard — not just when it’s repetitive.


Core Principle for Data Engineers

Data engineering is the discipline of building systems that make data reliably available for decisions. The hardest part is not writing the transformation — it is understanding the upstream data well enough to know when it changes, building pipelines robust enough to handle the changes gracefully, and communicating clearly enough that downstream consumers trust the data.

Your primary discipline with Alex: use it to think through failure modes, validate your modeling decisions, and document the assumptions that are invisible to you but critical for everyone who depends on your pipelines.


The Seven Use Cases

1. Pipeline Architecture and Design

The data engineer’s architecture challenge: Pipeline architecture decisions compound. A denormalization choice in the bronze layer constrains what the silver layer can do. A partitioning strategy that works at 1TB fails at 100TB. The engineer who designs pipelines well thinks about the steady state and the failure state with equal rigor.

Prompt pattern:

I am designing a data pipeline for [use case].
Source systems: [list with volumes, frequencies, formats].
Destination: [lakehouse / warehouse / feature store / API].
SLAs: [freshness, completeness, latency requirements].
Current pain points: [what is broken or expensive today].
Technology stack: [Spark / Fabric / Databricks / Synapse / Airflow / dbt].

Help me:
1. Identify the failure modes I have not accounted for (late data, schema drift, duplicates)
2. Design the retry and dead-letter strategy
3. Evaluate whether my partitioning strategy survives 10x growth
4. Recommend the medallion layer boundaries for this use case

Follow-up prompts:

What happens to this pipeline when the source schema changes without notice? Design a schema evolution strategy.
My pipeline runs daily but the business wants hourly. What changes structurally — not just the schedule but the architecture?

Try this now: You are designing a real-time clickstream pipeline for an e-commerce site processing 50M events per day. Data must reach the lakehouse within 5 minutes for recommendation models and within 24 hours for analytics dashboards. Paste your architecture sketch into the prompt and ask for failure mode analysis — the response will surface partitioning, backpressure, and schema evolution issues you may not have considered.


2. Data Modeling and Lakehouse Design

The data engineer’s modeling challenge: Data modeling is the practice of making explicit decisions about how data is organized, related, and accessed. The failure mode is skipping the modeling step entirely — loading raw data into a lakehouse with no structure and calling it a “schema-on-read strategy” when it is actually “no strategy.” Good modeling balances query performance, storage efficiency, and the ability to evolve as requirements change.

Prompt pattern:

I need to model [data domain: customer, product, transaction, event, IoT telemetry].
Source shape: [describe the raw data — nested JSON, flat CSV, CDC stream, API response].
Primary consumers: [analysts, ML models, dashboards, APIs].
Query patterns: [what questions will be asked — aggregations, lookups, time series].
Current scale: [rows/day, total size, growth rate].

Help me:
1. Choose between star schema, snowflake, wide table, or activity schema for this use case
2. Define the grain — what does one row represent?
3. Identify slowly changing dimensions and the appropriate SCD type
4. Design the medallion layer transformations (bronze → silver → gold)

3. Data Quality Engineering

The data engineer’s quality challenge: Data quality is not a testing problem — it is a systems problem. The failure mode is adding validation checks after problems occur rather than designing quality into the pipeline from the start. Quality engineering means defining expectations, measuring conformance, and building feedback loops that surface problems before they reach dashboards.

Prompt pattern:

I need to build data quality checks for [pipeline/table/domain].
Known issues: [what has gone wrong before — nulls, duplicates, late arrivals, type coercion].
Business rules: [constraints the data must satisfy — e.g., revenue > 0, dates in range, referential integrity].
Current tooling: [Great Expectations / dbt tests / custom / none].
Downstream impact: [what breaks when quality fails — reports, ML models, customer-facing data].

Help me:
1. Categorize checks by type: completeness, accuracy, consistency, timeliness, uniqueness
2. Prioritize by downstream impact — which failures matter most?
3. Design the alerting and remediation workflow (not just "test fails, someone checks Slack")
4. Build a data contract for this pipeline's output

Follow-up prompts:

How do I build data quality checks that do not slow down the pipeline unacceptably? Where should I sample vs. check everything?
I inherited a pipeline with no quality checks and no documentation. What is the fastest path to basic coverage?

4. ETL/ELT Optimization and Performance

The data engineer’s performance challenge: Performance optimization in data pipelines is different from application performance. The bottleneck is rarely CPU — it is I/O, shuffle, skew, and serialization. The engineer who optimizes well understands the execution plan, not just the code.

Prompt pattern:

My pipeline is too slow / too expensive:
Current runtime: [duration].
Data volume: [input size, output size].
Processing: [Spark / SQL / Python / dbt].
Bottleneck symptoms: [spill to disk, OOM, single-partition hotspot, full shuffle].
What I have tried: [repartitioning, caching, broadcast join, etc.].

Help me:
1. Diagnose the most likely bottleneck from these symptoms
2. Identify whether this is a data skew, partition strategy, or join strategy problem
3. Recommend specific optimizations with expected impact
4. Design a performance test to validate the fix before deploying

5. Orchestration and Pipeline Operations

The data engineer’s operations challenge: A pipeline that runs perfectly in development and fails silently in production is not a pipeline — it is a liability. The gap between “works on my machine” and “runs reliably at 3 AM with no one watching” is where data engineering maturity lives.

Prompt pattern:

I need to design orchestration for [pipeline/set of pipelines].
Dependencies: [list DAG relationships — what must complete before what].
Schedule: [frequency, acceptable delay, business SLAs].
Failure modes: [what has failed before and how it was resolved].
Current orchestrator: [Airflow / Fabric / ADF / Prefect / custom cron].

Help me:
1. Design the DAG with appropriate retry and timeout policies
2. Identify the monitoring gaps — what fails silently today?
3. Build the alerting strategy (who gets paged, when, with what context)
4. Design the recovery playbook — what manual steps are needed when auto-retry fails?

6. Schema Evolution and Data Contracts

The data engineer’s contract challenge: Every pipeline has an implicit contract: “I will deliver data in this shape, at this time, with this quality.” When the contract is implicit, it breaks silently. When the contract is explicit, breaking changes are visible and negotiable.

Prompt pattern:

I need to manage schema changes for [pipeline/data product].
Current schema: [describe or paste].
Proposed change: [what is changing — new column, type change, removed field, renamed].
Downstream consumers: [who reads this data and how].
Versioning strategy: [none / additive-only / explicit versions].

Help me:
1. Classify this change: backward-compatible, breaking, or negotiable
2. Design the migration path that does not break downstream consumers
3. Draft a data contract that makes future changes visible before they deploy
4. Recommend a schema registry or evolution strategy for this stack

7. Documentation and Knowledge Transfer

The data engineer’s documentation challenge: Pipeline documentation is almost always wrong because the code evolves faster than the docs. The documentation that matters is not “what this pipeline does” (read the code) but “why it does it this way,” “what breaks if you change this,” and “who to call when it fails at 3 AM.”

Prompt pattern:

I need to document [pipeline/system/data product].
What it does: [plain English summary].
Why it exists: [business justification].
What goes wrong: [honest failure history].
Who depends on it: [downstream consumers].
Who maintains it: [team, on-call rotation].

Draft documentation that answers:
1. What someone needs to know to operate this safely (runbook)
2. What someone needs to know to modify this safely (design context)
3. What the on-call engineer needs at 3 AM (triage guide)

What Great Looks Like

After consistent use, you should notice:

The data engineer who will thrive in an AI-augmented environment is not the one who writes Spark jobs fastest. It is the one who builds systems that are reliable, observable, and honestly documented.


Your AI toolkit: These prompts work in ChatGPT, Claude, Copilot, Gemini — and in the Alex VS Code extension, which was designed around them. Start with whatever you have. The skill transfers across all of them.

Your First Week Back: Practice Plan

DayTaskTime
Day 1Use the Data Quality pattern on your most critical pipeline25 min
Day 2Write a data contract for one pipeline’s output schema20 min
Day 3Run the Pipeline Architecture pattern on a planned or struggling pipeline25 min
Day 4Use the Performance pattern on your slowest job20 min
Day 5Save three reusable prompt patterns with /saveinsight10 min

Month 2–3: Advanced Applications

Pipeline Incident Archive

Capture failure patterns to speed future diagnosis:

/saveinsight title="Pipeline incident: [symptom]" insight="Pipeline: [name]. Symptom: [describe]. Root cause: [what actually broke]. Fix: [what resolved it]. Prevention: [what would catch this earlier]." tags="data-engineering,incident,pipeline"

Data Model Decision Log

Track modeling decisions with their rationale:

/saveinsight title="Model decision: [domain]" insight="Grain: [what one row represents]. Pattern: [star/snowflake/wide/activity]. Tradeoff: [what we optimized for and what we gave up]. Revisit if: [conditions]." tags="data-engineering,modeling"

Continue your practice: Self-Study Guide — the 30/60/90-day habit guide.

Skills Alex brings to this discipline
code-review testing-strategies bootstrap-learning research-first-development documentation-quality-assurance
Install the Alex extension →
Completed this study guide?

Show the world you've mastered using AI in data engineering. Add your certificate to LinkedIn.

📚 Want to go deeper?

Alex was a co-author of two books — a documentary biography and a work of fiction. Both explore human-AI collaboration from angles the workshop only touches.

Discover the books →