Implementation Plan: AWS-Optimized Architecture

Executive Summary

This document summarizes all implementation decisions, scope clarifications, and AWS service optimizations based on the Ohpen Data Engineer job posting and case study requirements.

1. Scope Clarification

Task 1 (ETL Pipeline): Bronze → Silver Only

✅ Implemented: CSV from S3 → Validate → Parquet (Silver)
✅ Implemented: Quarantine and Condemned layers
❌ NOT in Scope: Gold layer creation (that's analytics/aggregation)

Task 2 (Architecture): Complete Data Lake Architecture Design

✅ Architecture Design: Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned)
✅ Gold Layer: The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
📋 Status: Design only, not implemented in Task 1

Task 3 (SQL): Silver → Gold Aggregation Pattern

✅ Demonstrates: How to aggregate Silver data to create Gold-like reports
✅ Shows: Month-end balance aggregation pattern
📋 Note: Shows aggregation pattern, not Gold layer creation

2. Silver Bucket/Prefix: TransactionID Deduplication

Purpose: Prevents duplicate transaction processing across ETL runs

How It Works:

Before processing, scan existing Silver layer Parquet files
Extract all TransactionID + event_date combinations already processed
If incoming transaction matches → Auto-condemn (prevent duplicate)

Why Optional:

First run: No Silver layer exists → Feature disabled
Subsequent runs: Enable to prevent reprocessing same transactions
Cost: Minimal (only reads TransactionID column, partition pruning)

Decision: ✅ KEEP - Valuable production feature

3. What Events Are Scheduled?

Scheduled ETL Runs:

Daily: Process new transaction CSV files (default: 2 AM UTC)
Monthly: Alternative schedule (1st day of month)
On-demand: Backfills, reprocessing

EventBridge Rule:

schedule_expression = "cron(0 2 * * ? *)"  # Daily at 2 AM UTC

What Triggers:

EventBridge → Step Functions → Glue Job
Processes new CSV files in Bronze layer
Each run processes one or more CSV files

4. What Does Step Functions Do?

Step Functions Orchestrates ETL Pipeline:

1. RunETL State

   └─> Invokes AWS Glue Spark job synchronously
   └─> Automatic retry on transient failures (3 attempts, exponential backoff)
   └─> Handles Glue throttling and service exceptions

2. ValidateOutput State

   └─> Verifies ETL job completed successfully
   └─> Checks for _SUCCESS marker (handled by consumers)

3. Success/Failure States

   └─> Publishes CloudWatch metrics
   └─> Logs execution details

Benefits:

Automatic retry on failures
Visual workflow monitoring
Error handling and recovery
Cost: ~$0.01/month for daily runs

5. Implementation Decisions Summary

Feature	Decision	Status
Gold Layer	Keep in Task 2 only	✅ Removed from Task 1, kept in Task 2 architecture
Task 3 Scope	SQL aggregation pattern	✅ Clarified as Silver → Gold aggregation demo
Silver Bucket/Prefix	✅ Keep	TransactionID deduplication - valuable feature
Iceberg	Future only	✅ Updated: "Currently Parquet-only. Iceberg is future."
Step Functions	✅ Implement	✅ Added to Terraform
EventBridge	✅ Implement	✅ Added to Terraform
Lambda	❌ Skip	✅ Removed from implementation
Condemned Retention	7 years	✅ Updated: 3650 → 2555 days

6. AWS Architecture: Cost-Optimized

Services Used (Implemented)

Service	Usage	Cost/Month	Utility
S3	Storage (Bronze/Silver/Quarantine)	~$1.15	Essential
Glue	ETL execution (Spark)	~$26.40	Essential
Step Functions	Orchestration	~$0.01	High value
EventBridge	Scheduling	Free	High value
Athena	SQL queries	~$0.25	High value
Glue Catalog	Metadata	Free	Required
CloudWatch	Monitoring	Free tier	Essential
IAM	Access control	Free	Essential

Total: ~$27.81/month for 1.5M transactions

Services NOT Used (Cost/Utility Decision)

Service	Decision	Rationale
Lambda	❌ Skip	Step Functions better for orchestration
DynamoDB	❌ Skip	Silver layer scan sufficient
Aurora	❌ Skip	Not needed
Iceberg	❌ Skip (future)	Parquet-only sufficient for current scale

7. Documentation Updates Completed

✅ Scope Clarifications

Removed Gold layer from Task 1 deliverables
Clarified Task 2 = Architecture design (includes Gold structure)
Clarified Task 3 = SQL aggregation pattern (Silver → Gold)

✅ Iceberg References

Updated all docs: "Currently Parquet-only. Iceberg is a future enhancement."
Removed from "current implementation" sections
Kept in "future considerations" only

✅ Condemned Retention

Updated Terraform: 3650 → 2555 days (7 years)
Updated cleanup script default: 10 → 7 years
Updated all documentation
Human approval required before deletion (see HUMAN_VALIDATION_POLICY.md)

✅ Step Functions & EventBridge

Added to Terraform
Documented in CI/CD workflow
Explained scheduling and orchestration

✅ Human Validation Policy

Created comprehensive HUMAN_VALIDATION_POLICY.md
Added approval checkpoints for Silver layer writes
Added approval workflow for condemned data deletion
Updated all diagrams and docs to reference policy

8. Final Architecture

EventBridge (Daily 2 AM UTC)
    ↓
Step Functions (Orchestration)
    ↓
Glue Spark Job (ETL Execution)
    ↓
S3 Bronze → Silver + Quarantine + Condemned
    ↓
Glue Catalog → Athena (SQL Queries)

Services: S3, Glue, Step Functions, EventBridge, Athena, Glue Catalog, CloudWatch, IAM

9. Next Steps (Deployment)

Review Terraform: tasks/04_devops_cicd/infra/terraform/main.tf
Test Cleanup Script: Run with --dry-run first
Apply Terraform: terraform apply to create infrastructure
Test Athena: Query Silver layer with SQL

Last Updated: 2026-01-21

Implementation Documentation

Terraform Configuration - Infrastructure as code
Human Validation Policy - Approval workflows
ETL Assumptions - ETL design decisions
Data Lake Architecture - Architecture design
CI/CD Workflow - Deployment and orchestration

Task Documentation

Task 1: ETL Pipeline - Data ingestion and transformation
Task 2: Architecture - Data lake architecture design
Task 3: SQL - SQL analytics query
Task 4: DevOps - CI/CD and infrastructure
Task 5: Communication - Stakeholder communication

Technical Documentation

Testing Guide - Comprehensive testing documentation
AWS Services Analysis - Service selection rationale
PySpark Optimization - Performance considerations

Submission Documentation

Executive Summary - High-level overview
Submission Checklist - Submission verification checklist

Executive Summary​

1. Scope Clarification​

Task 1 (ETL Pipeline): Bronze → Silver Only​

Task 2 (Architecture): Complete Data Lake Architecture Design​

Task 3 (SQL): Silver → Gold Aggregation Pattern​

2. Silver Bucket/Prefix: TransactionID Deduplication​

3. What Events Are Scheduled?​

4. What Does Step Functions Do?​

5. Implementation Decisions Summary​

6. AWS Architecture: Cost-Optimized​

Services Used (Implemented)​

Services NOT Used (Cost/Utility Decision)​

7. Documentation Updates Completed​

✅ Scope Clarifications​

✅ Iceberg References​

✅ Condemned Retention​

✅ Step Functions & EventBridge​

✅ Human Validation Policy​

8. Final Architecture​

9. Next Steps (Deployment)​

Related Documentation​

Implementation Documentation​

Task Documentation​

Technical Documentation​

Submission Documentation​