Skip to main content

Implementation Plan: AWS-Optimized Architecture

Executive Summary

This document summarizes all implementation decisions, scope clarifications, and AWS service optimizations based on the Ohpen Data Engineer job posting and case study requirements.


1. Scope Clarification

Task 1 (ETL Pipeline): Bronze → Silver Only

  • Implemented: CSV from S3 → Validate → Parquet (Silver)
  • Implemented: Quarantine and Condemned layers
  • NOT in Scope: Gold layer creation (that's analytics/aggregation)

Task 2 (Architecture): Complete Data Lake Architecture Design

  • Architecture Design: Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned)
  • Gold Layer: The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
  • 📋 Status: Design only, not implemented in Task 1

Task 3 (SQL): Silver → Gold Aggregation Pattern

  • Demonstrates: How to aggregate Silver data to create Gold-like reports
  • Shows: Month-end balance aggregation pattern
  • 📋 Note: Shows aggregation pattern, not Gold layer creation

2. Silver Bucket/Prefix: TransactionID Deduplication

Purpose: Prevents duplicate transaction processing across ETL runs

How It Works:

  1. Before processing, scan existing Silver layer Parquet files
  2. Extract all TransactionID + event_date combinations already processed
  3. If incoming transaction matches → Auto-condemn (prevent duplicate)

Why Optional:

  • First run: No Silver layer exists → Feature disabled
  • Subsequent runs: Enable to prevent reprocessing same transactions
  • Cost: Minimal (only reads TransactionID column, partition pruning)

Decision: ✅ KEEP - Valuable production feature


3. What Events Are Scheduled?

Scheduled ETL Runs:

  • Daily: Process new transaction CSV files (default: 2 AM UTC)
  • Monthly: Alternative schedule (1st day of month)
  • On-demand: Backfills, reprocessing

EventBridge Rule:

schedule_expression = "cron(0 2 * * ? *)"  # Daily at 2 AM UTC

What Triggers:

  • EventBridge → Step Functions → Glue Job
  • Processes new CSV files in Bronze layer
  • Each run processes one or more CSV files

4. What Does Step Functions Do?

Step Functions Orchestrates ETL Pipeline:


1. RunETL State

└─> Invokes AWS Glue Spark job synchronously
└─> Automatic retry on transient failures (3 attempts, exponential backoff)
└─> Handles Glue throttling and service exceptions

2. ValidateOutput State

└─> Verifies ETL job completed successfully
└─> Checks for _SUCCESS marker (handled by consumers)

3. Success/Failure States

└─> Publishes CloudWatch metrics
└─> Logs execution details

Benefits:

  • Automatic retry on failures
  • Visual workflow monitoring
  • Error handling and recovery
  • Cost: ~$0.01/month for daily runs

5. Implementation Decisions Summary

FeatureDecisionStatus
Gold LayerKeep in Task 2 only✅ Removed from Task 1, kept in Task 2 architecture
Task 3 ScopeSQL aggregation pattern✅ Clarified as Silver → Gold aggregation demo
Silver Bucket/Prefix✅ KeepTransactionID deduplication - valuable feature
IcebergFuture only✅ Updated: "Currently Parquet-only. Iceberg is future."
Step Functions✅ Implement✅ Added to Terraform
EventBridge✅ Implement✅ Added to Terraform
Lambda❌ Skip✅ Removed from implementation
Condemned Retention7 years✅ Updated: 3650 → 2555 days

6. AWS Architecture: Cost-Optimized

Services Used (Implemented)

ServiceUsageCost/MonthUtility
S3Storage (Bronze/Silver/Quarantine)~$1.15Essential
GlueETL execution (Spark)~$26.40Essential
Step FunctionsOrchestration~$0.01High value
EventBridgeSchedulingFreeHigh value
AthenaSQL queries~$0.25High value
Glue CatalogMetadataFreeRequired
CloudWatchMonitoringFree tierEssential
IAMAccess controlFreeEssential

Total: ~$27.81/month for 1.5M transactions

Services NOT Used (Cost/Utility Decision)

ServiceDecisionRationale
Lambda❌ SkipStep Functions better for orchestration
DynamoDB❌ SkipSilver layer scan sufficient
Aurora❌ SkipNot needed
Iceberg❌ Skip (future)Parquet-only sufficient for current scale

7. Documentation Updates Completed

✅ Scope Clarifications

  • Removed Gold layer from Task 1 deliverables
  • Clarified Task 2 = Architecture design (includes Gold structure)
  • Clarified Task 3 = SQL aggregation pattern (Silver → Gold)

✅ Iceberg References

  • Updated all docs: "Currently Parquet-only. Iceberg is a future enhancement."
  • Removed from "current implementation" sections
  • Kept in "future considerations" only

✅ Condemned Retention

  • Updated Terraform: 3650 → 2555 days (7 years)
  • Updated cleanup script default: 10 → 7 years
  • Updated all documentation
  • Human approval required before deletion (see HUMAN_VALIDATION_POLICY.md)

✅ Step Functions & EventBridge

  • Added to Terraform
  • Documented in CI/CD workflow
  • Explained scheduling and orchestration

✅ Human Validation Policy

  • Created comprehensive HUMAN_VALIDATION_POLICY.md
  • Added approval checkpoints for Silver layer writes
  • Added approval workflow for condemned data deletion
  • Updated all diagrams and docs to reference policy

8. Final Architecture

EventBridge (Daily 2 AM UTC)

Step Functions (Orchestration)

Glue Spark Job (ETL Execution)

S3 Bronze → Silver + Quarantine + Condemned

Glue Catalog → Athena (SQL Queries)

Services: S3, Glue, Step Functions, EventBridge, Athena, Glue Catalog, CloudWatch, IAM


9. Next Steps (Deployment)

  1. Review Terraform: tasks/04_devops_cicd/infra/terraform/main.tf
  2. Test Cleanup Script: Run with --dry-run first
  3. Apply Terraform: terraform apply to create infrastructure
  4. Test Athena: Query Silver layer with SQL

Last Updated: 2026-01-21


Implementation Documentation

Task Documentation

Technical Documentation

Submission Documentation