Handout: Ohpen Data Engineer Case Study (2026)

Core narrative (use this line everywhere)

I assume an append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.

High-level pipeline diagram

What’s implemented (Task 1–5)

Task 1 (ETL): ETL Implementation
Reads CSV from AWS S3.
Validates required fields + currency allowlist + timestamp parsing.
Writes Parquet partitioned by year/month, and isolates each run under run_id=....
Writes invalid records to quarantine with validation_error.
Writes _SUCCESS with run metrics.
- See Assumptions and Edge Cases for design details.
Task 2 (Architecture): Data Lake Architecture + diagram
Raw / Processed / Aggregates + Quarantine layout.
Schema evolution strategy: Iceberg-first, Parquet fallback.
Task 3 (SQL): Balance History Query
Month-end “as-of” balances for Jan–Mar 2024.
Generates account×month spine and left-joins, so missing months stay NULL (as in example).
- See SQL Breakdown for design rationale.
Task 4 (CI/CD + IaC): CI/CD Workflow
- CI workflow: GitHub Actions
- Terraform stub: Infrastructure as Code
Task 5 (Comms + Docs): Communication Documentation
- Stakeholder email template + one-page technical summary.

How to demo quickly (local)

From projects/ohpen-case-2026/:

Set up AWS credentials and run ETL:

source .venv/bin/activate
export AWS_ACCESS_KEY_ID=<your-aws-access-key>
export AWS_SECRET_ACCESS_KEY=<your-aws-secret-key>
export AWS_REGION=us-east-1

python3 tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py \
  --input-bucket ohpen-raw \
  --input-key transactions/transactions.csv \
  --output-bucket ohpen-processed \
  --output-prefix transactions \
  --quarantine-bucket ohpen-quarantine \
  --quarantine-prefix transactions

Expected: valid rows written under processed/transactions/run_id=.../year=YYYY/month=MM/ and invalid rows under quarantine/.../run_id=.../.

Monitoring & Observability (per task)

Task 1 (ETL job)

Metrics to emit (structured logs + CloudWatch/Datadog):
run_id, input_rows, valid_rows, quarantined_rows
duration_seconds, bytes_read, bytes_written
quarantine_by_reason (e.g., Invalid Currency, Missing required fields, Invalid Timestamp)
Alerts:
Job failure (non-zero exit, no _SUCCESS)
Quarantine rate spike (e.g., quarantined_rows / input_rows > 1%)
Volume anomaly (too few or too many rows vs baseline)
Run completeness:
_SUCCESS marker with metrics JSON is the “commit” signal for consumers.
Optional (describe-only): manifest.json listing files + counts + schema hash.

Task 2 (Data Lake health)

Storage & access:
Bucket versioning, lifecycle policies (raw retention vs processed retention).
S3 4xx/5xx error rates, request latency, throttling.
Data governance checks (automated):
“No public access” enforcement.
Tagging compliance (env, owner, cost-center).
Glue Catalog drift (expected partitions present, table schema matches expectations).

Task 3 (SQL / Athena usage)

Cost & performance:
Track Athena “bytes scanned” per query (FinOps).
Track query runtime and failure rate (operational reliability).
Quality signals:
Monitor for sudden increase in NULL month-ends (may indicate missing ingestion for a period).

Task 4 (CI/CD + deployments)

CI health:
Test pass rate, time-to-green, lint failures.
PR lead time / deployment frequency (DORA-style).
Release safety:
Artifact versioning (git SHA tag) so backfills can run with a known version.
Terraform plan/apply drift detection (fail pipeline on unexpected infra drift).

Task 5 (Communication & documentation)

Operational cadence:
Scheduled stakeholder updates for key pipelines (monthly/weekly).
Post-incident comms template for failures or data corrections.
Doc freshness:
Treat the one-pager as living documentation; update on schema changes or new consumers.

Interview Q&A (short, practical answers)

“Why run_id partitions?”

To make runs reproducible and rerunnable (including historical periods) without overwriting previous outputs; it also makes debugging and comparisons between runs straightforward.

“How do you do backfills safely?”

Reprocess the relevant raw partition(s), write a new run_id, validate counts, then publish the corrected dataset (catalog pointer/view, or Iceberg snapshot).

“How do you prevent breaking schema changes?”

Treat schema as a contract: default to additive + optional changes, deprecate before removal, and avoid semantic drift (a field keeping the same name but changing meaning). For analytics, Iceberg + Glue Catalog gives controlled evolution; for integrations/events, enforce compatibility rules at the boundary (validation + versioning discipline).

“How do you do zero-downtime schema evolution with long retention/replay?”

Assume old data will be replayed years later, so new fields must remain optional and consumers must tolerate missing fields. For breaking changes, use either:

Dual-read/dual-logic via a version flag (rolling migration), or
Versioned datasets/streams (v2) + backfill/migrate + retire v1 when semantics/security require it.

“What do you do for security-driven breaking changes (e.g., encryption)?”

If old data must not remain, do a controlled migration: re-encrypt/reprocess into a v2 dataset/topic, cut consumers over, then producers, then delete/retire v1. If dual-support is acceptable short-term, use an encryption_version flag so consumers can decrypt both during rollout.

“When do you move from Python to Spark/Glue?”

When the input volume or transformation complexity requires distributed compute (large files, heavy joins, or sustained high throughput). Migrate from Python/pandas to AWS Glue Spark jobs. The interface stays the same; only the execution engine changes (Glue Python Shell → Glue Spark).

"Why DuckDB for SQL tests instead of a real database?"

For the case study, DuckDB validates syntax and query logic in milliseconds without external dependencies. For production, we'd add integration tests against Aurora/RDS replicas to catch engine-specific behavior. See SQL Testing Guide for details.

"Why not Testcontainers for integration tests?"

Current tests use temp files + mocking (fast, simple, validates core logic). Testcontainers would be the next layer for production: testing S3 API edge cases, multipart uploads, real Glue workflows. Trade-off: speed vs realism—case study prioritizes the former, production needs both. See Testing Guide for comprehensive testing documentation.

Core narrative (use this line everywhere)​

High-level pipeline diagram​

What’s implemented (Task 1–5)​

How to demo quickly (local)​

Monitoring & Observability (per task)​

Task 1 (ETL job)​

Task 2 (Data Lake health)​

Task 3 (SQL / Athena usage)​

Task 4 (CI/CD + deployments)​

Task 5 (Communication & documentation)​

Interview Q&A (short, practical answers)​

“Why run_id partitions?”​

“How do you do backfills safely?”​

“How do you prevent breaking schema changes?”​

“How do you do zero-downtime schema evolution with long retention/replay?”​

“What do you do for security-driven breaking changes (e.g., encryption)?”​

“When do you move from Python to Spark/Glue?”​

"Why DuckDB for SQL tests instead of a real database?"​

"Why not Testcontainers for integration tests?"​