Handout: Ohpen Data Engineer Case Study (2026)
Core narrative (use this line everywhere)​
I assume an append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.
High-level pipeline diagram​
What’s implemented (Task 1–5)​
- Task 1 (ETL): ETL Implementation
- Reads CSV from AWS S3.
- Validates required fields + currency allowlist + timestamp parsing.
- Writes Parquet partitioned by year/month, and isolates each run under
run_id=.... - Writes invalid records to quarantine with
validation_error. - Writes
_SUCCESSwith run metrics.- See Assumptions and Edge Cases for design details.
- Task 2 (Architecture): Data Lake Architecture + diagram
- Raw / Processed / Aggregates + Quarantine layout.
- Schema evolution strategy: Iceberg-first, Parquet fallback.
- Task 3 (SQL): Balance History Query
- Month-end “as-of” balances for Jan–Mar 2024.
- Generates accountĂ—month spine and left-joins, so missing months stay
NULL(as in example).- See SQL Breakdown for design rationale.
- Task 4 (CI/CD + IaC): CI/CD Workflow
- CI workflow: GitHub Actions
- Terraform stub: Infrastructure as Code
- Task 5 (Comms + Docs): Communication Documentation
How to demo quickly (local)​
From projects/ohpen-case-2026/:
- Set up AWS credentials and run ETL:
source .venv/bin/activate
export AWS_ACCESS_KEY_ID=<your-aws-access-key>
export AWS_SECRET_ACCESS_KEY=<your-aws-secret-key>
export AWS_REGION=us-east-1
python3 tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py \
--input-bucket ohpen-raw \
--input-key transactions/transactions.csv \
--output-bucket ohpen-processed \
--output-prefix transactions \
--quarantine-bucket ohpen-quarantine \
--quarantine-prefix transactions
Expected: valid rows written under processed/transactions/run_id=.../year=YYYY/month=MM/ and invalid rows under quarantine/.../run_id=.../.
Monitoring & Observability (per task)​
Task 1 (ETL job)​
- Metrics to emit (structured logs + CloudWatch/Datadog):
run_id,input_rows,valid_rows,quarantined_rowsduration_seconds,bytes_read,bytes_writtenquarantine_by_reason(e.g.,Invalid Currency,Missing required fields,Invalid Timestamp)- Alerts:
- Job failure (non-zero exit, no
_SUCCESS) - Quarantine rate spike (e.g.,
quarantined_rows / input_rows > 1%) - Volume anomaly (too few or too many rows vs baseline)
- Run completeness:
_SUCCESSmarker with metrics JSON is the “commit” signal for consumers.- Optional (describe-only):
manifest.jsonlisting files + counts + schema hash.
Task 2 (Data Lake health)​
- Storage & access:
- Bucket versioning, lifecycle policies (raw retention vs processed retention).
- S3 4xx/5xx error rates, request latency, throttling.
- Data governance checks (automated):
- “No public access” enforcement.
- Tagging compliance (env, owner, cost-center).
- Glue Catalog drift (expected partitions present, table schema matches expectations).
Task 3 (SQL / Athena usage)​
- Cost & performance:
- Track Athena “bytes scanned” per query (FinOps).
- Track query runtime and failure rate (operational reliability).
- Quality signals:
- Monitor for sudden increase in
NULLmonth-ends (may indicate missing ingestion for a period).
Task 4 (CI/CD + deployments)​
- CI health:
- Test pass rate, time-to-green, lint failures.
- PR lead time / deployment frequency (DORA-style).
- Release safety:
- Artifact versioning (git SHA tag) so backfills can run with a known version.
- Terraform plan/apply drift detection (fail pipeline on unexpected infra drift).
Task 5 (Communication & documentation)​
- Operational cadence:
- Scheduled stakeholder updates for key pipelines (monthly/weekly).
- Post-incident comms template for failures or data corrections.
- Doc freshness:
- Treat the one-pager as living documentation; update on schema changes or new consumers.
Interview Q&A (short, practical answers)​
“Why run_id partitions?”​
To make runs reproducible and rerunnable (including historical periods) without overwriting previous outputs; it also makes debugging and comparisons between runs straightforward.
“How do you do backfills safely?”​
Reprocess the relevant raw partition(s), write a new run_id, validate counts, then publish the corrected dataset (catalog pointer/view, or Iceberg snapshot).
“How do you prevent breaking schema changes?”​
Treat schema as a contract: default to additive + optional changes, deprecate before removal, and avoid semantic drift (a field keeping the same name but changing meaning). For analytics, Iceberg + Glue Catalog gives controlled evolution; for integrations/events, enforce compatibility rules at the boundary (validation + versioning discipline).
“How do you do zero-downtime schema evolution with long retention/replay?”​
Assume old data will be replayed years later, so new fields must remain optional and consumers must tolerate missing fields. For breaking changes, use either:
- Dual-read/dual-logic via a version flag (rolling migration), or
- Versioned datasets/streams (v2) + backfill/migrate + retire v1 when semantics/security require it.
“What do you do for security-driven breaking changes (e.g., encryption)?”​
If old data must not remain, do a controlled migration: re-encrypt/reprocess into a v2 dataset/topic, cut consumers over, then producers, then delete/retire v1. If dual-support is acceptable short-term, use an encryption_version flag so consumers can decrypt both during rollout.
“When do you move from Python to Spark/Glue?”​
When the input volume or transformation complexity requires distributed compute (large files, heavy joins, or sustained high throughput). Migrate from Python/pandas to AWS Glue Spark jobs. The interface stays the same; only the execution engine changes (Glue Python Shell → Glue Spark).
"Why DuckDB for SQL tests instead of a real database?"​
For the case study, DuckDB validates syntax and query logic in milliseconds without external dependencies. For production, we'd add integration tests against Aurora/RDS replicas to catch engine-specific behavior. See SQL Testing Guide for details.
"Why not Testcontainers for integration tests?"​
Current tests use temp files + mocking (fast, simple, validates core logic). Testcontainers would be the next layer for production: testing S3 API edge cases, multipart uploads, real Glue workflows. Trade-off: speed vs realism—case study prioritizes the former, production needs both. See Testing Guide for comprehensive testing documentation.