Skip to main content

AWS Services Analysis & Opportunities

Based on the Ohpen Data Engineer job posting requirements:

"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization."

AWS Services Usage Overview

Current AWS Service Usage

✅ Currently Implemented

| Service Usage Location | --------- ------- ---------- | S3 ✅ Primary storage for Bronze/Silver/Gold layers infra/terraform/main.tf - 4 buckets (raw, processed, quarantine, artifacts) | Glue ✅ ETL job definition (Python Shell) infra/terraform/main.tf - aws_glue_job.transaction_etl | CloudWatch ✅ Monitoring & alarms infra/terraform/main.tf - aws_cloudwatch_metric_alarm | Athena 📝 Mentioned in architecture/docs Referenced in architecture.md, HANDOUT.md but not in Terraform

❌ Not Currently Implemented

| Service Status Opportunity | --------- -------- ------------- | Lambda ❌ Not used High-value opportunities (see below) | DynamoDB ❌ Not used Metadata & state management | Aurora ❌ Not used Operational data & reference data


Opportunities to Leverage More AWS Services

1. AWS Lambda - Event-Driven Processing & Automation

Current Gap

  • ETL runs are scheduled/manual (no event-driven triggers)
  • No serverless automation for common tasks

Opportunities

A. S3 Event-Driven Ingestion


# Trigger ETL when new CSV files arrive in Bronze layer

resource "aws_lambda_function" "s3_trigger_etl" {
function_name = "ohpen-s3-trigger-etl"
handler = "lambda_function.lambda_handler"
runtime = "python3.11"
role = aws_iam_role.lambda_trigger_role.arn

environment {
variables = {
GLUE_JOB_NAME = aws_glue_job.transaction_etl.name
}
}
}

resource "aws_s3_bucket_notification" "bronze_ingestion" {
bucket = aws_s3_bucket.data_lake_buckets["ohpen-raw"].id

lambda_function {
lambda_function_arn = aws_lambda_function.s3_trigger_etl.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "bronze/mortgages/transactions/"
filter_suffix = ".csv"
}
}

```text

## Use Cases

- **Auto-trigger ETL** when new CSV files land in S3 (replaces scheduled runs)
- **File validation** before triggering ETL (check file size, format)
- **Metadata extraction** (extract file metadata, update catalog)
- **Quarantine notification** (send alerts when quarantine rate exceeds threshold)

## B. Post-ETL Validation & Promotion

```terraform

# Lambda to validate ETL output and promote to current/

resource "aws_lambda_function" "validate_and_promote" {
function_name = "ohpen-validate-promote"
# Validates _SUCCESS marker, row counts, schema compliance
# Updates _LATEST.json and copies to current/ prefix
}

```text

## C. Data Quality Monitoring

```terraform

# Lambda to analyze quarantine patterns and generate reports

resource "aws_lambda_function" "quarantine_analyzer" {
function_name = "ohpen-quarantine-analyzer"
# Runs after each ETL, analyzes quarantine reasons
# Updates DynamoDB with quality metrics
}

```text

## Benefits

- **Cost-effective**: Pay per invocation, no idle costs
- **Scalable**: Auto-scales with S3 event volume
- **Event-driven**: Real-time processing instead of polling
- **Integration**: Easy integration with S3, Glue, Step Functions

---

### 2. **AWS Glue** - Enhanced Usage Beyond Basic ETL

#### Current Usage

- ✅ Basic Glue Job defined (Python Shell)
- ❌ Missing: Glue Data Catalog tables, Glue Crawlers, Glue Workflows

#### Opportunities (2)

## A. Glue Data Catalog Tables

```terraform
resource "aws_glue_catalog_database" "ohpen_data_lake" {
name = "ohpen_data_lake"
}

resource "aws_glue_catalog_table" "silver_transactions" {
name = "silver_transactions"
database_name = aws_glue_catalog_database.ohpen_data_lake.name

storage_descriptor {
location = "s3://ohpen-processed/silver/mortgages/transactions/"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
serde_info {
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns {
name = "transaction_id"
type = "string"
}
columns {
name = "customer_id"
type = "string"
}
columns {
name = "transaction_amount"
type = "decimal(18,2)"
}
columns {
name = "currency"
type = "string"
}
columns {
name = "transaction_timestamp"
type = "timestamp"
}
}

partition_keys {
name = "event_year"
type = "int"
}
partition_keys {
name = "event_month"
type = "string"
}
}

```text

## B. Glue Crawlers

```terraform
resource "aws_glue_crawler" "silver_crawler" {
database_name = aws_glue_catalog_database.ohpen_data_lake.name
name = "ohpen-silver-crawler"
role = aws_iam_role.glue_crawler_role.arn

s3_target {
path = "s3://ohpen-processed/silver/mortgages/transactions/"
}

schema_change_policy {
delete_behavior = "LOG"
update_behavior = "UPDATE_IN_DATABASE"
}
}

```text

## C. Glue Workflows

```terraform
resource "aws_glue_workflow" "transaction_pipeline" {
name = "ohpen-transaction-pipeline"
# Orchestrates: Crawl → ETL → Validate → Promote
}

```text

## Benefits (2)

- **Athena Integration**: Glue Catalog is required for Athena queries
- **Schema Discovery**: Crawlers auto-detect partitions and schema changes
- **Workflow Orchestration**: Coordinate multiple Glue jobs

---

### 3. **Amazon Athena** - SQL Analytics (Currently Only Referenced)

#### Current Gap (2)

- ✅ Mentioned in architecture/docs as target for SQL queries
- ❌ **No Terraform resources** (no workgroups, no query result buckets)
- ❌ **No Glue Catalog tables** (Athena can't query without catalog)

#### Opportunities (3)

## A. Athena Workgroup & Configuration

```terraform
resource "aws_athena_workgroup" "ohpen_analytics" {
name = "ohpen-analytics"

configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true

result_configuration {
output_location = "s3://ohpen-artifacts/athena-results/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}

engine_version {
selected_engine_version = "Athena engine version 3"
}
}
}

```text

## B. Query Result Bucket

```terraform

# Already have ohpen-artifacts bucket, but should add explicit Athena results path


# s3://ohpen-artifacts/athena-results/

C. Integration with Task 3 SQL

  • The SQL query in tasks/03_sql/balance_history_2024_q1.sql is designed for Athena
  • Need Glue Catalog tables (see Glue section above) to make it work
  • Add Athena workgroup configuration for cost optimization

Benefits (3)

  • Serverless SQL: No infrastructure to manage
  • Cost-effective: Pay per query, partition pruning reduces costs
  • Integration: Works seamlessly with Glue Catalog and S3

4. Amazon DynamoDB - Metadata & State Management

Current Gap (3)

  • No metadata store for run tracking, schema versions, or operational state
  • Currently relies on S3 _LATEST.json files (not queryable)

Opportunities (4)

A. ETL Run Metadata Table

resource "aws_dynamodb_table" "etl_runs" {
name = "ohpen-etl-runs"
billing_mode = "PAY_PER_REQUEST"
hash_key = "run_id"
range_key = "dataset_name"

attribute {
name = "run_id"
type = "S"
}
attribute {
name = "dataset_name"
type = "S"
}

ttl {
attribute_name = "ttl"
enabled = true
}
}

```text

## Use Cases (2)

- **Run Tracking**: Store run_id, status, metrics, timestamps
- **Schema Version Registry**: Track schema versions per dataset
- **Quarantine Metrics**: Store quarantine statistics per run
- **Lineage**: Track which raw files produced which processed outputs

## B. Data Quality Metrics Table

```terraform
resource "aws_dynamodb_table" "data_quality_metrics" {
name = "ohpen-data-quality"
billing_mode = "PAY_PER_REQUEST"
hash_key = "dataset_name"
range_key = "metric_date"

attribute {
name = "dataset_name"
type = "S"
}
attribute {
name = "metric_date"
type = "S"
}
}

```text

## Use Cases (3)

- **Quality Dashboards**: Query recent quality metrics
- **Trend Analysis**: Track quality over time
- **Alerting**: Lambda can query DynamoDB for threshold violations

## C. Schema Registry

```terraform
resource "aws_dynamodb_table" "schema_registry" {
name = "ohpen-schema-registry"
billing_mode = "PAY_PER_REQUEST"
hash_key = "dataset_name"
range_key = "schema_version"

attribute {
name = "dataset_name"
type = "S"
}
attribute {
name = "schema_version"
type = "S"
}
}

```text

## Benefits (4)

- **Fast Queries**: Sub-millisecond latency for metadata lookups
- **Scalable**: Auto-scales with usage
- **Integration**: Easy to query from Lambda, Glue, or application code

---

### 5. **Amazon Aurora** - Operational & Reference Data

#### Current Gap (4)

- No relational database for operational data, reference data, or transactional workloads

#### Opportunities (5)

## A. Reference Data Management

```terraform
resource "aws_rds_cluster" "ohpen_reference_data" {
cluster_identifier = "ohpen-reference-data"
engine = "aurora-postgresql"
engine_version = "15.3"
database_name = "ohpen_ref"
master_username = "ohpen_admin"
master_password = var.db_password
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
db_subnet_group_name = aws_db_subnet_group.ohpen.name
vpc_security_group_ids = [aws_security_group.rds.id]
skip_final_snapshot = true
}

Use Cases (4)

  • Currency Exchange Rates: Store FX rates for multi-currency reporting
  • Customer Master Data: Reference data for customer IDs
  • Product/Account Catalogs: Mortgage products, account types
  • Data Quality Rules: Store validation rules, allowlists, thresholds

B. Operational Metadata

  • ETL Job Status: Track job execution history (alternative to DynamoDB for complex queries)
  • Data Lineage: Store relationships between datasets
  • Audit Logs: Long-term storage of operational events

C. Integration with Task 3 SQL (2)

  • Currently Task 3 SQL is designed for Athena (S3-based)
  • Could also support Aurora for operational queries (faster for small datasets)
  • Hybrid approach: S3/Athena for analytics, Aurora for operational queries

Benefits (5)

  • ACID Compliance: Critical for financial data integrity
  • Performance: Fast queries for reference data lookups
  • Familiar SQL: PostgreSQL-compatible, easy for analysts
  • High Availability: Multi-AZ deployment for production

Phase 1: Critical for Athena (High Priority)

  1. Glue Data Catalog Tables - Required for Athena to work
  2. Athena Workgroup - Enable SQL queries
  3. Glue Crawlers - Auto-discover partitions and schema

Phase 2: Event-Driven Automation (Medium Priority)

  1. Lambda for S3 Triggers - Auto-trigger ETL on file arrival
  2. Lambda for Validation/Promotion - Automate post-ETL steps

Phase 3: Enhanced Metadata & Operations (Medium Priority)

  1. DynamoDB for Run Metadata - Track ETL runs, quality metrics
  2. DynamoDB for Schema Registry - Version control for schemas

Phase 4: Reference Data & Operational DB (Lower Priority)

  1. Aurora for Reference Data - FX rates, customer master data
  2. Aurora for Operational Queries - Fast queries for small datasets

| Service Current Recommended Priority | --------- --------- ------------- ---------- | S3 ✅ 4 buckets ✅ Enhanced with lifecycle policies, versioning ✅ Done | Glue ✅ Basic job ✅ + Catalog tables, Crawlers, Workflows 🔴 High | Athena 📝 Mentioned only ✅ Workgroup, result bucket, catalog integration 🔴 High | Lambda ❌ Not used ✅ S3 triggers, validation, automation 🟡 Medium | DynamoDB ❌ Not used ✅ Run metadata, quality metrics, schema registry 🟡 Medium | Aurora ❌ Not used ✅ Reference data, operational queries 🟢 Low | CloudWatch ✅ Alarms ✅ Enhanced with custom metrics, dashboards ✅ Done


Next Steps

  1. Update Terraform (infra/terraform/main.tf):

    • Add Glue Catalog database and tables
    • Add Athena workgroup
    • Add Lambda functions for S3 triggers
    • Add DynamoDB tables for metadata
  2. Update ETL Code:

    • Add Lambda handler for S3 event processing
    • Add DynamoDB writes for run metadata
    • Add Glue Catalog table updates after ETL
  3. Update Documentation:

    • Document new AWS services in architecture.md
    • Update diagrams to show Lambda, DynamoDB, Aurora
    • Add operational runbooks for new services
© 2026 Stephen AdeiCC BY 4.0