Implementation Plan: AWS-Optimized Architecture
Executive Summary
This document summarizes all implementation decisions, scope clarifications, and AWS service optimizations based on the Ohpen Data Engineer job posting and case study requirements.
1. Scope Clarification
Task 1 (ETL Pipeline): Bronze → Silver Only
- ✅ Implemented: CSV from S3 → Validate → Parquet (Silver)
- ✅ Implemented: Quarantine and Condemned layers
- ❌ NOT in Scope: Gold layer creation (that's analytics/aggregation)
Task 2 (Architecture): Complete Data Lake Architecture Design
- ✅ Architecture Design: Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned)
- ✅ Gold Layer: The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
- 📋 Status: Design only, not implemented in Task 1
Task 3 (SQL): Silver → Gold Aggregation Pattern
- ✅ Demonstrates: How to aggregate Silver data to create Gold-like reports
- ✅ Shows: Month-end balance aggregation pattern
- 📋 Note: Shows aggregation pattern, not Gold layer creation
2. Silver Bucket/Prefix: TransactionID Deduplication
Purpose: Prevents duplicate transaction processing across ETL runs
How It Works:
- Before processing, scan existing Silver layer Parquet files
- Extract all
TransactionID + event_datecombinations already processed - If incoming transaction matches → Auto-condemn (prevent duplicate)
Why Optional:
- First run: No Silver layer exists → Feature disabled
- Subsequent runs: Enable to prevent reprocessing same transactions
- Cost: Minimal (only reads TransactionID column, partition pruning)
Decision: ✅ KEEP - Valuable production feature
3. What Events Are Scheduled?
Scheduled ETL Runs:
- Daily: Process new transaction CSV files (default: 2 AM UTC)
- Monthly: Alternative schedule (1st day of month)
- On-demand: Backfills, reprocessing
EventBridge Rule:
schedule_expression = "cron(0 2 * * ? *)" # Daily at 2 AM UTC
What Triggers:
- EventBridge → Step Functions → Glue Job
- Processes new CSV files in Bronze layer
- Each run processes one or more CSV files
4. What Does Step Functions Do?
Step Functions Orchestrates ETL Pipeline:
1. RunETL State
└─> Invokes AWS Glue Spark job synchronously
└─> Automatic retry on transient failures (3 attempts, exponential backoff)
└─> Handles Glue throttling and service exceptions
2. ValidateOutput State
└─> Verifies ETL job completed successfully
└─> Checks for _SUCCESS marker (handled by consumers)
3. Success/Failure States
└─> Publishes CloudWatch metrics
└─> Logs execution details
Benefits:
- Automatic retry on failures
- Visual workflow monitoring
- Error handling and recovery
- Cost: ~$0.01/month for daily runs
5. Implementation Decisions Summary
| Feature | Decision | Status |
|---|---|---|
| Gold Layer | Keep in Task 2 only | ✅ Removed from Task 1, kept in Task 2 architecture |
| Task 3 Scope | SQL aggregation pattern | ✅ Clarified as Silver → Gold aggregation demo |
| Silver Bucket/Prefix | ✅ Keep | TransactionID deduplication - valuable feature |
| Iceberg | Future only | ✅ Updated: "Currently Parquet-only. Iceberg is future." |
| Step Functions | ✅ Implement | ✅ Added to Terraform |
| EventBridge | ✅ Implement | ✅ Added to Terraform |
| Lambda | ❌ Skip | ✅ Removed from implementation |
| Condemned Retention | 7 years | ✅ Updated: 3650 → 2555 days |
6. AWS Architecture: Cost-Optimized
Services Used (Implemented)
| Service | Usage | Cost/Month | Utility |
|---|---|---|---|
| S3 | Storage (Bronze/Silver/Quarantine) | ~$1.15 | Essential |
| Glue | ETL execution (Spark) | ~$26.40 | Essential |
| Step Functions | Orchestration | ~$0.01 | High value |
| EventBridge | Scheduling | Free | High value |
| Athena | SQL queries | ~$0.25 | High value |
| Glue Catalog | Metadata | Free | Required |
| CloudWatch | Monitoring | Free tier | Essential |
| IAM | Access control | Free | Essential |
Total: ~$27.81/month for 1.5M transactions
Services NOT Used (Cost/Utility Decision)
| Service | Decision | Rationale |
|---|---|---|
| Lambda | ❌ Skip | Step Functions better for orchestration |
| DynamoDB | ❌ Skip | Silver layer scan sufficient |
| Aurora | ❌ Skip | Not needed |
| Iceberg | ❌ Skip (future) | Parquet-only sufficient for current scale |
7. Documentation Updates Completed
✅ Scope Clarifications
- Removed Gold layer from Task 1 deliverables
- Clarified Task 2 = Architecture design (includes Gold structure)
- Clarified Task 3 = SQL aggregation pattern (Silver → Gold)
✅ Iceberg References
- Updated all docs: "Currently Parquet-only. Iceberg is a future enhancement."
- Removed from "current implementation" sections
- Kept in "future considerations" only
✅ Condemned Retention
- Updated Terraform: 3650 → 2555 days (7 years)
- Updated cleanup script default: 10 → 7 years
- Updated all documentation
- Human approval required before deletion (see
HUMAN_VALIDATION_POLICY.md)
✅ Step Functions & EventBridge
- Added to Terraform
- Documented in CI/CD workflow
- Explained scheduling and orchestration
✅ Human Validation Policy
- Created comprehensive
HUMAN_VALIDATION_POLICY.md - Added approval checkpoints for Silver layer writes
- Added approval workflow for condemned data deletion
- Updated all diagrams and docs to reference policy
8. Final Architecture
EventBridge (Daily 2 AM UTC)
↓
Step Functions (Orchestration)
↓
Glue Spark Job (ETL Execution)
↓
S3 Bronze → Silver + Quarantine + Condemned
↓
Glue Catalog → Athena (SQL Queries)
Services: S3, Glue, Step Functions, EventBridge, Athena, Glue Catalog, CloudWatch, IAM
9. Next Steps (Deployment)
- Review Terraform:
tasks/04_devops_cicd/infra/terraform/main.tf - Test Cleanup Script: Run with
--dry-runfirst - Apply Terraform:
terraform applyto create infrastructure - Test Athena: Query Silver layer with SQL
Last Updated: 2026-01-21
Related Documentation
Implementation Documentation
- Terraform Configuration - Infrastructure as code
- Human Validation Policy - Approval workflows
- ETL Assumptions - ETL design decisions
- Data Lake Architecture - Architecture design
- CI/CD Workflow - Deployment and orchestration
Task Documentation
- Task 1: ETL Pipeline - Data ingestion and transformation
- Task 2: Architecture - Data lake architecture design
- Task 3: SQL - SQL analytics query
- Task 4: DevOps - CI/CD and infrastructure
- Task 5: Communication - Stakeholder communication
Technical Documentation
- Testing Guide - Comprehensive testing documentation
- AWS Services Analysis - Service selection rationale
- PySpark Optimization - Performance considerations
Submission Documentation
- Executive Summary - High-level overview
- Submission Checklist - Submission verification checklist