Task 2: Data Lake Architecture Design

Important Note on Scope

⚠️ Scope Disclaimer: This architecture design is based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture presented here should be viewed as a starting point that would require:

Validation against actual data volumes and growth patterns
Alignment with existing infrastructure and operational practices
Refinement based on real-world compliance and regulatory requirements
Collaboration with the Ohpen team to understand production constraints

This design demonstrates architectural thinking and best practices, but would need significant refinement for production deployment.

1. Lake Architecture Overview

The lake follows a Bronze / Silver / Gold medallion architecture model.

Layer	Purpose	Mutability	Primary Consumers	Implementation Status
Bronze	Raw audit trail	Immutable	Platform engineers, audit	✅ Implemented (Task 1)
Silver	Validated analytics	Append-only	Analysts, data scientists	✅ Implemented (Task 1)
Gold	Business contracts	Versioned	Finance, BI, stakeholders	Task 2: Structure Design, Task 3: SQL Aggregation

Note:

Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (does not write Gold layer files)
Task 2 (Architecture): Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned). The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries (shows how to create Gold data)

Data Lifecycle State Diagram

This state diagram visualizes the data lifecycle and state transitions through the medallion architecture:

2. Folder Structure (S3)

We adopt a structured Data Lake organization to support scalability, reproducibility, and data quality.

Data Lake Structure Diagram

📁 STORAGE PERSPECTIVE: This diagram shows the static storage organization—how data is physically organized in S3 folders, what partitioning schemes are used, and what file formats are stored. Think of this as answering: "Where is the data stored and how is it organized?"

This simplified diagram shows the S3 folder structure with partitioning keys on the arrows:

Detailed folder structure examples: See Complete Folder Structure Examples in Appendix D.

Key Distinction: Storage vs. Flow

Structure Diagram (above): Shows WHERE data lives—the S3 folder hierarchy, partition keys, file formats, and storage organization
Architecture Diagram (Section 7): Shows HOW data moves—the process flow from CSV ingestion through ETL transformations to final consumption

These two perspectives are complementary: the structure diagram answers storage questions (e.g., "Where do I find Bronze data from January 2024?"), while the architecture diagram answers process questions (e.g., "What happens to data after it's ingested?").

Key Path Patterns

Bronze: bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz
Silver: silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet
Quarantine: quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet
Condemned: quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet
Gold: gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet
Gold (current): gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet

Key Design Choices

run_id Isolation: Every ETL run writes to a unique path. This prevents partial overwrites and allows safe backfills.
Quarantine Layer: We never silently drop data. Bad rows are preserved for audit and debugging with metadata enrichment.
Condemned Layer: Rows exceeding max attempts or exact duplicates are moved to condemned layer (no automatic retries). Human review and approval required before reprocessing or deletion.
Loop Prevention: TransactionID deduplication, metadata enrichment, duplicate detection, and attempt limit enforcement prevent infinite retry loops.
Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window.
Partitioning:
- Bronze: Partitioned by ingest_date (when it arrived)
- Silver: Partitioned by year/month (business-time, derived from TransactionTimestamp)
- Gold: Partitioned by business dimensions (e.g., as_of_month)
Schema Versioning: All Silver and Gold paths include schema_v={version} to support schema evolution.
Object Store Safety: Use write-then-publish pattern with _SUCCESS markers; treat partition discovery as eventually consistent.

3. Safe Publishing Pattern & Authority

Write-Then-Publish Pattern

Each dataset run follows this pattern:

Writes data files to a run_id-scoped prefix
Writes a _SUCCESS marker once complete
Consumers only read runs with _SUCCESS markers present

Authority Mechanisms

Two mechanisms manage authoritative datasets:

_LATEST.json: Control-plane object pointing to the authoritative run (used for promotion/rollback)
current/ Prefix: Stable location for Amazon Athena queries via AWS Glue Data Catalog tables

Detailed publishing patterns: See Safe Publishing Pattern & Authority in Appendix F.

4. Schema Evolution Strategy

As business requirements change, our architecture must adapt without downtime. We manage schema evolution with:

Backward-compatible reads: Updated consumers can read both old and new data
Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
Schema versioning: All changes use new schema_v versions
Parquet-only (current): Additive changes only, with Glue Data Catalog as source of truth

Detailed schema evolution strategy: See Schema Evolution Strategy in Appendix D for compatibility policies, breaking change playbooks, and type change examples.

5. Backfills & Reprocessing

Backfills recompute only affected partitions (e.g., reprocess year=2024/month=03)
New outputs are written under a new run_id (e.g., run_id=20260128_BACKFILL)
Promotion occurs by updating _LATEST.json and current/ after validation
No historical data is overwritten; all run_id paths are preserved for audit

Detailed backfill flow: See Backfills & Reprocessing in Appendix D.

6. Ownership & Governance

Layer	Owner	Steward	Responsibility	Implementation Status
Bronze	Data Platform Team	Ingestion Lead	Immutability, raw data preservation, ingestion reliability	✅ Implemented (Task 1)
Silver	Domain Teams	Domain Analyst	Validation logic, schema evolution, data quality	✅ Implemented (Task 1)
Gold	Business (Finance)	Finance Controller	Business contracts, reporting accuracy, stakeholder communication	Task 2: Complete Architecture Design, Task 3: SQL Aggregation

Rules

Bronze is immutable and restricted: Only platform team can write; no direct business user access
Silver schema changes require domain + platform review: Domain team proposes, platform team implements
Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via schema_v

Detailed governance workflow: See Ownership & Governance in Appendix D.

7. Architecture Diagram

🔄 FLOW PERSPECTIVE: This diagram shows the dynamic data flow—how data moves and transforms through the system from ingestion to consumption. Think of this as answering: "How does data flow through the pipeline and what processes transform it?"

This simplified diagram shows the high-level data flow through the lake:

Detailed architecture diagram: See Complete Architecture Diagram in Appendix D for the full diagram with all components, AWS services, and implementation status.

Key Components

Bronze Layer: Raw, immutable CSV files (S3)
ETL Pipeline: AWS Glue PySpark job validates and transforms data
Silver Layer: Validated Parquet files partitioned by year/month (S3)
Quarantine: Invalid rows preserved for audit (S3)
Gold Layer: Business aggregates with _LATEST.json and current/ prefix (S3)
Athena: SQL query engine reading via Glue Data Catalog
Glue Data Catalog: Table metadata and partition registration

Athena & Glue Separation: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines to use the same Glue Catalog metadata.

System Context Diagram (C4)

This C4 diagram shows the system context and relationships between users and systems:

8. Failure Mode Analysis

This section documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.

Key Failure Scenarios

If _SUCCESS Marker is Missing: Consumers cannot identify complete runs, leading to incomplete queries
If _LATEST.json is Missing: No authoritative pointer to current dataset, making promotion/rollback impossible
If Glue Data Catalog is Missing: Athena cannot locate tables, all SQL queries fail
If Quarantine Layer is Removed: Invalid data is silently dropped, audit trail lost
If Bronze Layer is Overwritten: Historical audit trail lost; backfills become impossible
If run_id Isolation is Removed: Reruns overwrite previous outputs, causing data loss
If Schema Versioning is Removed: Schema changes break existing consumers; no backward compatibility

System Resilience Principles

Fail-Safe Defaults: System fails in a safe state (no partial data published)
Defense in Depth: Multiple layers of validation prevent single points of failure
Auditability: All operations are logged (CloudTrail, CloudWatch)
Immutability: Critical layers (Bronze) are append-only
Idempotency: Reruns are safe; each run writes to unique paths

Detailed failure mode analysis: See Failure Mode Analysis in Appendix D for complete impact, symptoms, root causes, detection, and mitigation strategies.

Input Sources

ETL Pipeline Assumptions - What writes to this architecture
ETL Flow & Pseudocode - Visual representation and pseudocode of data flow
Human Validation Policy - Approval workflows for data operations

Query Layer

SQL Query - Example query on this architecture
SQL Breakdown - Query design rationale and structure

Infrastructure

CI/CD Workflow - How this architecture is deployed
Terraform Configuration - Infrastructure as code

Technical Analysis

AWS Services Analysis - Service selection rationale
PySpark Optimization - Performance considerations
Complete Architecture Details - Detailed folder structures, schema evolution, governance workflows, and failure mode analysis

Important Note on Scope​

1. Lake Architecture Overview​

Data Lifecycle State Diagram​

2. Folder Structure (S3)​

Data Lake Structure Diagram​

Key Distinction: Storage vs. Flow​

Key Path Patterns​

Key Design Choices​

3. Safe Publishing Pattern & Authority​

Write-Then-Publish Pattern​

Authority Mechanisms​

4. Schema Evolution Strategy​

5. Backfills & Reprocessing​

6. Ownership & Governance​

Rules​

7. Architecture Diagram​

Key Components​

System Context Diagram (C4)​

8. Failure Mode Analysis​

Key Failure Scenarios​

System Resilience Principles​

Related Documentation​

Input Sources​

Query Layer​

Infrastructure​

Technical Analysis​