AWS Services Analysis & Opportunities

Based on the Ohpen Data Engineer job posting requirements:

"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization."

AWS Services Usage Overview

Current AWS Service Usage

✅ Currently Implemented

| Service Usage Location | --------- ------- ---------- | S3 ✅ Primary storage for Bronze/Silver/Gold layers infra/terraform/main.tf - 4 buckets (raw, processed, quarantine, artifacts) | Glue ✅ ETL job definition (Python Shell) infra/terraform/main.tf - aws_glue_job.transaction_etl | CloudWatch ✅ Monitoring & alarms infra/terraform/main.tf - aws_cloudwatch_metric_alarm | Athena 📝 Mentioned in architecture/docs Referenced in architecture.md, HANDOUT.md but not in Terraform

❌ Not Currently Implemented

| Service Status Opportunity | --------- -------- ------------- | Lambda ❌ Not used High-value opportunities (see below) | DynamoDB ❌ Not used Metadata & state management | Aurora ❌ Not used Operational data & reference data

Opportunities to Leverage More AWS Services

1. AWS Lambda - Event-Driven Processing & Automation

Current Gap

ETL runs are scheduled/manual (no event-driven triggers)
No serverless automation for common tasks

Opportunities

A. S3 Event-Driven Ingestion

# Trigger ETL when new CSV files arrive in Bronze layer

resource "aws_lambda_function" "s3_trigger_etl" {
  function_name = "ohpen-s3-trigger-etl"
  handler      = "lambda_function.lambda_handler"
  runtime      = "python3.11"
  role         = aws_iam_role.lambda_trigger_role.arn

  environment {
    variables = {
      GLUE_JOB_NAME = aws_glue_job.transaction_etl.name
    }
  }
}

resource "aws_s3_bucket_notification" "bronze_ingestion" {
  bucket = aws_s3_bucket.data_lake_buckets["ohpen-raw"].id

  lambda_function {
    lambda_function_arn = aws_lambda_function.s3_trigger_etl.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = "bronze/mortgages/transactions/"
    filter_suffix       = ".csv"
  }
}

```text

## Use Cases

- **Auto-trigger ETL** when new CSV files land in S3 (replaces scheduled runs)
- **File validation** before triggering ETL (check file size, format)
- **Metadata extraction** (extract file metadata, update catalog)
- **Quarantine notification** (send alerts when quarantine rate exceeds threshold)

## B. Post-ETL Validation & Promotion

```terraform

# Lambda to validate ETL output and promote to current/

resource "aws_lambda_function" "validate_and_promote" {
  function_name = "ohpen-validate-promote"
  # Validates _SUCCESS marker, row counts, schema compliance
  # Updates _LATEST.json and copies to current/ prefix
}

```text

## C. Data Quality Monitoring

```terraform

# Lambda to analyze quarantine patterns and generate reports

resource "aws_lambda_function" "quarantine_analyzer" {
  function_name = "ohpen-quarantine-analyzer"
  # Runs after each ETL, analyzes quarantine reasons
  # Updates DynamoDB with quality metrics
}

```text

## Benefits

- **Cost-effective**: Pay per invocation, no idle costs
- **Scalable**: Auto-scales with S3 event volume
- **Event-driven**: Real-time processing instead of polling
- **Integration**: Easy integration with S3, Glue, Step Functions

---

### 2. **AWS Glue** - Enhanced Usage Beyond Basic ETL

#### Current Usage

- ✅ Basic Glue Job defined (Python Shell)
- ❌ Missing: Glue Data Catalog tables, Glue Crawlers, Glue Workflows

#### Opportunities (2)

## A. Glue Data Catalog Tables

```terraform
resource "aws_glue_catalog_database" "ohpen_data_lake" {
  name = "ohpen_data_lake"
}

resource "aws_glue_catalog_table" "silver_transactions" {
  name          = "silver_transactions"
  database_name = aws_glue_catalog_database.ohpen_data_lake.name

  storage_descriptor {
    location      = "s3://ohpen-processed/silver/mortgages/transactions/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    serde_info {
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
    }
    columns {
      name = "transaction_id"
      type = "string"
    }
    columns {
      name = "customer_id"
      type = "string"
    }
    columns {
      name = "transaction_amount"
      type = "decimal(18,2)"
    }
    columns {
      name = "currency"
      type = "string"
    }
    columns {
      name = "transaction_timestamp"
      type = "timestamp"
    }
  }

  partition_keys {
    name = "event_year"
    type = "int"
  }
  partition_keys {
    name = "event_month"
    type = "string"
  }
}

```text

## B. Glue Crawlers

```terraform
resource "aws_glue_crawler" "silver_crawler" {
  database_name = aws_glue_catalog_database.ohpen_data_lake.name
  name          = "ohpen-silver-crawler"
  role          = aws_iam_role.glue_crawler_role.arn

  s3_target {
    path = "s3://ohpen-processed/silver/mortgages/transactions/"
  }

  schema_change_policy {
    delete_behavior = "LOG"
    update_behavior  = "UPDATE_IN_DATABASE"
  }
}

```text

## C. Glue Workflows

```terraform
resource "aws_glue_workflow" "transaction_pipeline" {
  name = "ohpen-transaction-pipeline"
  # Orchestrates: Crawl → ETL → Validate → Promote
}

```text

## Benefits (2)

- **Athena Integration**: Glue Catalog is required for Athena queries
- **Schema Discovery**: Crawlers auto-detect partitions and schema changes
- **Workflow Orchestration**: Coordinate multiple Glue jobs

---

### 3. **Amazon Athena** - SQL Analytics (Currently Only Referenced)

#### Current Gap (2)

- ✅ Mentioned in architecture/docs as target for SQL queries
- ❌ **No Terraform resources** (no workgroups, no query result buckets)
- ❌ **No Glue Catalog tables** (Athena can't query without catalog)

#### Opportunities (3)

## A. Athena Workgroup & Configuration

```terraform
resource "aws_athena_workgroup" "ohpen_analytics" {
  name = "ohpen-analytics"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://ohpen-artifacts/athena-results/"
      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }

    engine_version {
      selected_engine_version = "Athena engine version 3"
    }
  }
}

```text

## B. Query Result Bucket

```terraform

# Already have ohpen-artifacts bucket, but should add explicit Athena results path


# s3://ohpen-artifacts/athena-results/

C. Integration with Task 3 SQL

The SQL query in tasks/03_sql/balance_history_2024_q1.sql is designed for Athena
Need Glue Catalog tables (see Glue section above) to make it work
Add Athena workgroup configuration for cost optimization

Benefits (3)

Serverless SQL: No infrastructure to manage
Cost-effective: Pay per query, partition pruning reduces costs
Integration: Works seamlessly with Glue Catalog and S3

4. Amazon DynamoDB - Metadata & State Management

Current Gap (3)

No metadata store for run tracking, schema versions, or operational state
Currently relies on S3 _LATEST.json files (not queryable)

Opportunities (4)

A. ETL Run Metadata Table

resource "aws_dynamodb_table" "etl_runs" {
  name           = "ohpen-etl-runs"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "run_id"
  range_key      = "dataset_name"

  attribute {
    name = "run_id"
    type = "S"
  }
  attribute {
    name = "dataset_name"
    type = "S"
  }

  ttl {
    attribute_name = "ttl"
    enabled        = true
  }
}

```text

## Use Cases (2)

- **Run Tracking**: Store run_id, status, metrics, timestamps
- **Schema Version Registry**: Track schema versions per dataset
- **Quarantine Metrics**: Store quarantine statistics per run
- **Lineage**: Track which raw files produced which processed outputs

## B. Data Quality Metrics Table

```terraform
resource "aws_dynamodb_table" "data_quality_metrics" {
  name         = "ohpen-data-quality"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "dataset_name"
  range_key    = "metric_date"

  attribute {
    name = "dataset_name"
    type = "S"
  }
  attribute {
    name = "metric_date"
    type = "S"
  }
}

```text

## Use Cases (3)

- **Quality Dashboards**: Query recent quality metrics
- **Trend Analysis**: Track quality over time
- **Alerting**: Lambda can query DynamoDB for threshold violations

## C. Schema Registry

```terraform
resource "aws_dynamodb_table" "schema_registry" {
  name         = "ohpen-schema-registry"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "dataset_name"
  range_key    = "schema_version"

  attribute {
    name = "dataset_name"
    type = "S"
  }
  attribute {
    name = "schema_version"
    type = "S"
  }
}

```text

## Benefits (4)

- **Fast Queries**: Sub-millisecond latency for metadata lookups
- **Scalable**: Auto-scales with usage
- **Integration**: Easy to query from Lambda, Glue, or application code

---

### 5. **Amazon Aurora** - Operational & Reference Data

#### Current Gap (4)

- No relational database for operational data, reference data, or transactional workloads

#### Opportunities (5)

## A. Reference Data Management

```terraform
resource "aws_rds_cluster" "ohpen_reference_data" {
  cluster_identifier      = "ohpen-reference-data"
  engine                   = "aurora-postgresql"
  engine_version           = "15.3"
  database_name            = "ohpen_ref"
  master_username          = "ohpen_admin"
  master_password          = var.db_password
  backup_retention_period   = 7
  preferred_backup_window  = "03:00-04:00"
  db_subnet_group_name     = aws_db_subnet_group.ohpen.name
  vpc_security_group_ids   = [aws_security_group.rds.id]
  skip_final_snapshot      = true
}

Use Cases (4)

Currency Exchange Rates: Store FX rates for multi-currency reporting
Customer Master Data: Reference data for customer IDs
Product/Account Catalogs: Mortgage products, account types
Data Quality Rules: Store validation rules, allowlists, thresholds

B. Operational Metadata

ETL Job Status: Track job execution history (alternative to DynamoDB for complex queries)
Data Lineage: Store relationships between datasets
Audit Logs: Long-term storage of operational events

C. Integration with Task 3 SQL (2)

Currently Task 3 SQL is designed for Athena (S3-based)
Could also support Aurora for operational queries (faster for small datasets)
Hybrid approach: S3/Athena for analytics, Aurora for operational queries

Benefits (5)

ACID Compliance: Critical for financial data integrity
Performance: Fast queries for reference data lookups
Familiar SQL: PostgreSQL-compatible, easy for analysts
High Availability: Multi-AZ deployment for production

Recommended Implementation Priority

Phase 1: Critical for Athena (High Priority)

✅ Glue Data Catalog Tables - Required for Athena to work
✅ Athena Workgroup - Enable SQL queries
✅ Glue Crawlers - Auto-discover partitions and schema

Phase 2: Event-Driven Automation (Medium Priority)

✅ Lambda for S3 Triggers - Auto-trigger ETL on file arrival
✅ Lambda for Validation/Promotion - Automate post-ETL steps

Phase 3: Enhanced Metadata & Operations (Medium Priority)

✅ DynamoDB for Run Metadata - Track ETL runs, quality metrics
✅ DynamoDB for Schema Registry - Version control for schemas

Phase 4: Reference Data & Operational DB (Lower Priority)

✅ Aurora for Reference Data - FX rates, customer master data
✅ Aurora for Operational Queries - Fast queries for small datasets

Summary: Current vs. Recommended State

| Service Current Recommended Priority | --------- --------- ------------- ---------- | S3 ✅ 4 buckets ✅ Enhanced with lifecycle policies, versioning ✅ Done | Glue ✅ Basic job ✅ + Catalog tables, Crawlers, Workflows 🔴 High | Athena 📝 Mentioned only ✅ Workgroup, result bucket, catalog integration 🔴 High | Lambda ❌ Not used ✅ S3 triggers, validation, automation 🟡 Medium | DynamoDB ❌ Not used ✅ Run metadata, quality metrics, schema registry 🟡 Medium | Aurora ❌ Not used ✅ Reference data, operational queries 🟢 Low | CloudWatch ✅ Alarms ✅ Enhanced with custom metrics, dashboards ✅ Done

Next Steps

Update Terraform (infra/terraform/main.tf):
- Add Glue Catalog database and tables
- Add Athena workgroup
- Add Lambda functions for S3 triggers
- Add DynamoDB tables for metadata
Update ETL Code:
- Add Lambda handler for S3 event processing
- Add DynamoDB writes for run metadata
- Add Glue Catalog table updates after ETL
Update Documentation:
- Document new AWS services in architecture.md
- Update diagrams to show Lambda, DynamoDB, Aurora
- Add operational runbooks for new services

AWS Services Usage Overview​

Current AWS Service Usage​

✅ Currently Implemented​

❌ Not Currently Implemented​

Opportunities to Leverage More AWS Services​

1. AWS Lambda - Event-Driven Processing & Automation​

Current Gap​

Opportunities​

A. S3 Event-Driven Ingestion​

C. Integration with Task 3 SQL​

Benefits (3)​

4. Amazon DynamoDB - Metadata & State Management​

Current Gap (3)​

Opportunities (4)​

A. ETL Run Metadata Table​

Use Cases (4)​

B. Operational Metadata​

C. Integration with Task 3 SQL (2)​

Benefits (5)​

Recommended Implementation Priority​

Phase 1: Critical for Athena (High Priority)​

Phase 2: Event-Driven Automation (Medium Priority)​

Phase 3: Enhanced Metadata & Operations (Medium Priority)​

Phase 4: Reference Data & Operational DB (Lower Priority)​

Summary: Current vs. Recommended State​

Next Steps​