AWS Services Analysis & Opportunities
Based on the Ohpen Data Engineer job posting requirements:
"Leverage AWS services such as S3, Lambda, Glue, Athena, DynamoDB and Aurora in implementation and optimization."
AWS Services Usage Overview
Current AWS Service Usage
✅ Currently Implemented
| Service Usage Location
| --------- ------- ----------
| S3 ✅ Primary storage for Bronze/Silver/Gold layers infra/terraform/main.tf - 4 buckets (raw, processed, quarantine, artifacts)
| Glue ✅ ETL job definition (Python Shell) infra/terraform/main.tf - aws_glue_job.transaction_etl
| CloudWatch ✅ Monitoring & alarms infra/terraform/main.tf - aws_cloudwatch_metric_alarm
| Athena 📝 Mentioned in architecture/docs Referenced in architecture.md, HANDOUT.md but not in Terraform
❌ Not Currently Implemented
| Service Status Opportunity | --------- -------- ------------- | Lambda ❌ Not used High-value opportunities (see below) | DynamoDB ❌ Not used Metadata & state management | Aurora ❌ Not used Operational data & reference data
Opportunities to Leverage More AWS Services
1. AWS Lambda - Event-Driven Processing & Automation
Current Gap
- ETL runs are scheduled/manual (no event-driven triggers)
- No serverless automation for common tasks
Opportunities
A. S3 Event-Driven Ingestion
# Trigger ETL when new CSV files arrive in Bronze layer
resource "aws_lambda_function" "s3_trigger_etl" {
function_name = "ohpen-s3-trigger-etl"
handler = "lambda_function.lambda_handler"
runtime = "python3.11"
role = aws_iam_role.lambda_trigger_role.arn
environment {
variables = {
GLUE_JOB_NAME = aws_glue_job.transaction_etl.name
}
}
}
resource "aws_s3_bucket_notification" "bronze_ingestion" {
bucket = aws_s3_bucket.data_lake_buckets["ohpen-raw"].id
lambda_function {
lambda_function_arn = aws_lambda_function.s3_trigger_etl.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "bronze/mortgages/transactions/"
filter_suffix = ".csv"
}
}
```text
## Use Cases
- **Auto-trigger ETL** when new CSV files land in S3 (replaces scheduled runs)
- **File validation** before triggering ETL (check file size, format)
- **Metadata extraction** (extract file metadata, update catalog)
- **Quarantine notification** (send alerts when quarantine rate exceeds threshold)
## B. Post-ETL Validation & Promotion
```terraform
# Lambda to validate ETL output and promote to current/
resource "aws_lambda_function" "validate_and_promote" {
function_name = "ohpen-validate-promote"
# Validates _SUCCESS marker, row counts, schema compliance
# Updates _LATEST.json and copies to current/ prefix
}
```text
## C. Data Quality Monitoring
```terraform
# Lambda to analyze quarantine patterns and generate reports
resource "aws_lambda_function" "quarantine_analyzer" {
function_name = "ohpen-quarantine-analyzer"
# Runs after each ETL, analyzes quarantine reasons
# Updates DynamoDB with quality metrics
}
```text
## Benefits
- **Cost-effective**: Pay per invocation, no idle costs
- **Scalable**: Auto-scales with S3 event volume
- **Event-driven**: Real-time processing instead of polling
- **Integration**: Easy integration with S3, Glue, Step Functions
---
### 2. **AWS Glue** - Enhanced Usage Beyond Basic ETL
#### Current Usage
- ✅ Basic Glue Job defined (Python Shell)
- ❌ Missing: Glue Data Catalog tables, Glue Crawlers, Glue Workflows
#### Opportunities (2)
## A. Glue Data Catalog Tables
```terraform
resource "aws_glue_catalog_database" "ohpen_data_lake" {
name = "ohpen_data_lake"
}
resource "aws_glue_catalog_table" "silver_transactions" {
name = "silver_transactions"
database_name = aws_glue_catalog_database.ohpen_data_lake.name
storage_descriptor {
location = "s3://ohpen-processed/silver/mortgages/transactions/"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
serde_info {
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns {
name = "transaction_id"
type = "string"
}
columns {
name = "customer_id"
type = "string"
}
columns {
name = "transaction_amount"
type = "decimal(18,2)"
}
columns {
name = "currency"
type = "string"
}
columns {
name = "transaction_timestamp"
type = "timestamp"
}
}
partition_keys {
name = "event_year"
type = "int"
}
partition_keys {
name = "event_month"
type = "string"
}
}
```text
## B. Glue Crawlers
```terraform
resource "aws_glue_crawler" "silver_crawler" {
database_name = aws_glue_catalog_database.ohpen_data_lake.name
name = "ohpen-silver-crawler"
role = aws_iam_role.glue_crawler_role.arn
s3_target {
path = "s3://ohpen-processed/silver/mortgages/transactions/"
}
schema_change_policy {
delete_behavior = "LOG"
update_behavior = "UPDATE_IN_DATABASE"
}
}
```text
## C. Glue Workflows
```terraform
resource "aws_glue_workflow" "transaction_pipeline" {
name = "ohpen-transaction-pipeline"
# Orchestrates: Crawl → ETL → Validate → Promote
}
```text
## Benefits (2)
- **Athena Integration**: Glue Catalog is required for Athena queries
- **Schema Discovery**: Crawlers auto-detect partitions and schema changes
- **Workflow Orchestration**: Coordinate multiple Glue jobs
---
### 3. **Amazon Athena** - SQL Analytics (Currently Only Referenced)
#### Current Gap (2)
- ✅ Mentioned in architecture/docs as target for SQL queries
- ❌ **No Terraform resources** (no workgroups, no query result buckets)
- ❌ **No Glue Catalog tables** (Athena can't query without catalog)
#### Opportunities (3)
## A. Athena Workgroup & Configuration
```terraform
resource "aws_athena_workgroup" "ohpen_analytics" {
name = "ohpen-analytics"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://ohpen-artifacts/athena-results/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
engine_version {
selected_engine_version = "Athena engine version 3"
}
}
}
```text
## B. Query Result Bucket
```terraform
# Already have ohpen-artifacts bucket, but should add explicit Athena results path
# s3://ohpen-artifacts/athena-results/
C. Integration with Task 3 SQL
- The SQL query in
tasks/03_sql/balance_history_2024_q1.sqlis designed for Athena - Need Glue Catalog tables (see Glue section above) to make it work
- Add Athena workgroup configuration for cost optimization
Benefits (3)
- Serverless SQL: No infrastructure to manage
- Cost-effective: Pay per query, partition pruning reduces costs
- Integration: Works seamlessly with Glue Catalog and S3
4. Amazon DynamoDB - Metadata & State Management
Current Gap (3)
- No metadata store for run tracking, schema versions, or operational state
- Currently relies on S3
_LATEST.jsonfiles (not queryable)
Opportunities (4)
A. ETL Run Metadata Table
resource "aws_dynamodb_table" "etl_runs" {
name = "ohpen-etl-runs"
billing_mode = "PAY_PER_REQUEST"
hash_key = "run_id"
range_key = "dataset_name"
attribute {
name = "run_id"
type = "S"
}
attribute {
name = "dataset_name"
type = "S"
}
ttl {
attribute_name = "ttl"
enabled = true
}
}
```text
## Use Cases (2)
- **Run Tracking**: Store run_id, status, metrics, timestamps
- **Schema Version Registry**: Track schema versions per dataset
- **Quarantine Metrics**: Store quarantine statistics per run
- **Lineage**: Track which raw files produced which processed outputs
## B. Data Quality Metrics Table
```terraform
resource "aws_dynamodb_table" "data_quality_metrics" {
name = "ohpen-data-quality"
billing_mode = "PAY_PER_REQUEST"
hash_key = "dataset_name"
range_key = "metric_date"
attribute {
name = "dataset_name"
type = "S"
}
attribute {
name = "metric_date"
type = "S"
}
}
```text
## Use Cases (3)
- **Quality Dashboards**: Query recent quality metrics
- **Trend Analysis**: Track quality over time
- **Alerting**: Lambda can query DynamoDB for threshold violations
## C. Schema Registry
```terraform
resource "aws_dynamodb_table" "schema_registry" {
name = "ohpen-schema-registry"
billing_mode = "PAY_PER_REQUEST"
hash_key = "dataset_name"
range_key = "schema_version"
attribute {
name = "dataset_name"
type = "S"
}
attribute {
name = "schema_version"
type = "S"
}
}
```text
## Benefits (4)
- **Fast Queries**: Sub-millisecond latency for metadata lookups
- **Scalable**: Auto-scales with usage
- **Integration**: Easy to query from Lambda, Glue, or application code
---
### 5. **Amazon Aurora** - Operational & Reference Data
#### Current Gap (4)
- No relational database for operational data, reference data, or transactional workloads
#### Opportunities (5)
## A. Reference Data Management
```terraform
resource "aws_rds_cluster" "ohpen_reference_data" {
cluster_identifier = "ohpen-reference-data"
engine = "aurora-postgresql"
engine_version = "15.3"
database_name = "ohpen_ref"
master_username = "ohpen_admin"
master_password = var.db_password
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
db_subnet_group_name = aws_db_subnet_group.ohpen.name
vpc_security_group_ids = [aws_security_group.rds.id]
skip_final_snapshot = true
}
Use Cases (4)
- Currency Exchange Rates: Store FX rates for multi-currency reporting
- Customer Master Data: Reference data for customer IDs
- Product/Account Catalogs: Mortgage products, account types
- Data Quality Rules: Store validation rules, allowlists, thresholds
B. Operational Metadata
- ETL Job Status: Track job execution history (alternative to DynamoDB for complex queries)
- Data Lineage: Store relationships between datasets
- Audit Logs: Long-term storage of operational events
C. Integration with Task 3 SQL (2)
- Currently Task 3 SQL is designed for Athena (S3-based)
- Could also support Aurora for operational queries (faster for small datasets)
- Hybrid approach: S3/Athena for analytics, Aurora for operational queries
Benefits (5)
- ACID Compliance: Critical for financial data integrity
- Performance: Fast queries for reference data lookups
- Familiar SQL: PostgreSQL-compatible, easy for analysts
- High Availability: Multi-AZ deployment for production
Recommended Implementation Priority
Phase 1: Critical for Athena (High Priority)
- ✅ Glue Data Catalog Tables - Required for Athena to work
- ✅ Athena Workgroup - Enable SQL queries
- ✅ Glue Crawlers - Auto-discover partitions and schema
Phase 2: Event-Driven Automation (Medium Priority)
- ✅ Lambda for S3 Triggers - Auto-trigger ETL on file arrival
- ✅ Lambda for Validation/Promotion - Automate post-ETL steps
Phase 3: Enhanced Metadata & Operations (Medium Priority)
- ✅ DynamoDB for Run Metadata - Track ETL runs, quality metrics
- ✅ DynamoDB for Schema Registry - Version control for schemas
Phase 4: Reference Data & Operational DB (Lower Priority)
- ✅ Aurora for Reference Data - FX rates, customer master data
- ✅ Aurora for Operational Queries - Fast queries for small datasets
Summary: Current vs. Recommended State
| Service Current Recommended Priority | --------- --------- ------------- ---------- | S3 ✅ 4 buckets ✅ Enhanced with lifecycle policies, versioning ✅ Done | Glue ✅ Basic job ✅ + Catalog tables, Crawlers, Workflows 🔴 High | Athena 📝 Mentioned only ✅ Workgroup, result bucket, catalog integration 🔴 High | Lambda ❌ Not used ✅ S3 triggers, validation, automation 🟡 Medium | DynamoDB ❌ Not used ✅ Run metadata, quality metrics, schema registry 🟡 Medium | Aurora ❌ Not used ✅ Reference data, operational queries 🟢 Low | CloudWatch ✅ Alarms ✅ Enhanced with custom metrics, dashboards ✅ Done
Next Steps
-
Update Terraform (
infra/terraform/main.tf):- Add Glue Catalog database and tables
- Add Athena workgroup
- Add Lambda functions for S3 triggers
- Add DynamoDB tables for metadata
-
Update ETL Code:
- Add Lambda handler for S3 event processing
- Add DynamoDB writes for run metadata
- Add Glue Catalog table updates after ETL
-
Update Documentation:
- Document new AWS services in architecture.md
- Update diagrams to show Lambda, DynamoDB, Aurora
- Add operational runbooks for new services