Skip to main content

Handout: Ohpen Data Engineer Case Study (2026)

Core narrative (use this line everywhere)​

I assume an append-only raw layer and deterministic transformations, so curated outputs can be reproduced or recomputed for historical periods when necessary, while keeping raw inputs immutable.

High-level pipeline diagram​

What’s implemented (Task 1–5)​

How to demo quickly (local)​

From projects/ohpen-case-2026/:

  1. Set up AWS credentials and run ETL:
source .venv/bin/activate
export AWS_ACCESS_KEY_ID=<your-aws-access-key>
export AWS_SECRET_ACCESS_KEY=<your-aws-secret-key>
export AWS_REGION=us-east-1

python3 tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py \
--input-bucket ohpen-raw \
--input-key transactions/transactions.csv \
--output-bucket ohpen-processed \
--output-prefix transactions \
--quarantine-bucket ohpen-quarantine \
--quarantine-prefix transactions

Expected: valid rows written under processed/transactions/run_id=.../year=YYYY/month=MM/ and invalid rows under quarantine/.../run_id=.../.


Monitoring & Observability (per task)​

Task 1 (ETL job)​

  • Metrics to emit (structured logs + CloudWatch/Datadog):
  • run_id, input_rows, valid_rows, quarantined_rows
  • duration_seconds, bytes_read, bytes_written
  • quarantine_by_reason (e.g., Invalid Currency, Missing required fields, Invalid Timestamp)
  • Alerts:
  • Job failure (non-zero exit, no _SUCCESS)
  • Quarantine rate spike (e.g., quarantined_rows / input_rows > 1%)
  • Volume anomaly (too few or too many rows vs baseline)
  • Run completeness:
  • _SUCCESS marker with metrics JSON is the “commit” signal for consumers.
  • Optional (describe-only): manifest.json listing files + counts + schema hash.

Task 2 (Data Lake health)​

  • Storage & access:
  • Bucket versioning, lifecycle policies (raw retention vs processed retention).
  • S3 4xx/5xx error rates, request latency, throttling.
  • Data governance checks (automated):
  • “No public access” enforcement.
  • Tagging compliance (env, owner, cost-center).
  • Glue Catalog drift (expected partitions present, table schema matches expectations).

Task 3 (SQL / Athena usage)​

  • Cost & performance:
  • Track Athena “bytes scanned” per query (FinOps).
  • Track query runtime and failure rate (operational reliability).
  • Quality signals:
  • Monitor for sudden increase in NULL month-ends (may indicate missing ingestion for a period).

Task 4 (CI/CD + deployments)​

  • CI health:
  • Test pass rate, time-to-green, lint failures.
  • PR lead time / deployment frequency (DORA-style).
  • Release safety:
  • Artifact versioning (git SHA tag) so backfills can run with a known version.
  • Terraform plan/apply drift detection (fail pipeline on unexpected infra drift).

Task 5 (Communication & documentation)​

  • Operational cadence:
  • Scheduled stakeholder updates for key pipelines (monthly/weekly).
  • Post-incident comms template for failures or data corrections.
  • Doc freshness:
  • Treat the one-pager as living documentation; update on schema changes or new consumers.

Interview Q&A (short, practical answers)​

“Why run_id partitions?”​

To make runs reproducible and rerunnable (including historical periods) without overwriting previous outputs; it also makes debugging and comparisons between runs straightforward.

“How do you do backfills safely?”​

Reprocess the relevant raw partition(s), write a new run_id, validate counts, then publish the corrected dataset (catalog pointer/view, or Iceberg snapshot).

“How do you prevent breaking schema changes?”​

Treat schema as a contract: default to additive + optional changes, deprecate before removal, and avoid semantic drift (a field keeping the same name but changing meaning). For analytics, Iceberg + Glue Catalog gives controlled evolution; for integrations/events, enforce compatibility rules at the boundary (validation + versioning discipline).

“How do you do zero-downtime schema evolution with long retention/replay?”​

Assume old data will be replayed years later, so new fields must remain optional and consumers must tolerate missing fields. For breaking changes, use either:

  • Dual-read/dual-logic via a version flag (rolling migration), or
  • Versioned datasets/streams (v2) + backfill/migrate + retire v1 when semantics/security require it.

“What do you do for security-driven breaking changes (e.g., encryption)?”​

If old data must not remain, do a controlled migration: re-encrypt/reprocess into a v2 dataset/topic, cut consumers over, then producers, then delete/retire v1. If dual-support is acceptable short-term, use an encryption_version flag so consumers can decrypt both during rollout.

“When do you move from Python to Spark/Glue?”​

When the input volume or transformation complexity requires distributed compute (large files, heavy joins, or sustained high throughput). Migrate from Python/pandas to AWS Glue Spark jobs. The interface stays the same; only the execution engine changes (Glue Python Shell → Glue Spark).

"Why DuckDB for SQL tests instead of a real database?"​

For the case study, DuckDB validates syntax and query logic in milliseconds without external dependencies. For production, we'd add integration tests against Aurora/RDS replicas to catch engine-specific behavior. See SQL Testing Guide for details.

"Why not Testcontainers for integration tests?"​

Current tests use temp files + mocking (fast, simple, validates core logic). Testcontainers would be the next layer for production: testing S3 API edge cases, multipart uploads, real Glue workflows. Trade-off: speed vs realism—case study prioritizes the former, production needs both. See Testing Guide for comprehensive testing documentation.

© 2026 Stephen Adei•CC BY 4.0