Task 2: Data Lake Architecture Design
Important Note on Scopeβ
β οΈ Scope Disclaimer: This architecture design is based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture presented here should be viewed as a starting point that would require:
- Validation against actual data volumes and growth patterns
- Alignment with existing infrastructure and operational practices
- Refinement based on real-world compliance and regulatory requirements
- Collaboration with the Ohpen team to understand production constraints
This design demonstrates architectural thinking and best practices, but would need significant refinement for production deployment.
1. Lake Architecture Overviewβ
The lake follows a Bronze / Silver / Gold medallion architecture model.
| Layer | Purpose | Mutability | Primary Consumers | Implementation Status |
|---|---|---|---|---|
| Bronze | Raw audit trail | Immutable | Platform engineers, audit | β Implemented (Task 1) |
| Silver | Validated analytics | Append-only | Analysts, data scientists | β Implemented (Task 1) |
| Gold | Business contracts | Versioned | Finance, BI, stakeholders | Task 2: Structure Design, Task 3: SQL Aggregation |
Note:
- Task 1 (ETL Pipeline): Implements Bronze β Silver transformation only (does not write Gold layer files)
- Task 2 (Architecture): Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned). The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
- Task 3 (SQL): Demonstrates Silver β Gold aggregation patterns via SQL queries (shows how to create Gold data)
Data Lifecycle State Diagramβ
This state diagram visualizes the data lifecycle and state transitions through the medallion architecture:
2. Folder Structure (S3)β
We adopt a structured Data Lake organization to support scalability, reproducibility, and data quality.
Data Lake Structure Diagramβ
π STORAGE PERSPECTIVE: This diagram shows the static storage organizationβhow data is physically organized in S3 folders, what partitioning schemes are used, and what file formats are stored. Think of this as answering: "Where is the data stored and how is it organized?"
This simplified diagram shows the S3 folder structure with partitioning keys on the arrows:
Detailed folder structure examples: See Complete Folder Structure Examples in Appendix D.
Key Distinction: Storage vs. Flowβ
- Structure Diagram (above): Shows WHERE data livesβthe S3 folder hierarchy, partition keys, file formats, and storage organization
- Architecture Diagram (Section 7): Shows HOW data movesβthe process flow from CSV ingestion through ETL transformations to final consumption
These two perspectives are complementary: the structure diagram answers storage questions (e.g., "Where do I find Bronze data from January 2024?"), while the architecture diagram answers process questions (e.g., "What happens to data after it's ingested?").
Key Path Patternsβ
- Bronze:
bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz - Silver:
silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet - Quarantine:
quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Condemned:
quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet - Gold:
gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet - Gold (current):
gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet
Key Design Choicesβ
run_idIsolation: Every ETL run writes to a unique path. This prevents partial overwrites and allows safe backfills.- Quarantine Layer: We never silently drop data. Bad rows are preserved for audit and debugging with metadata enrichment.
- Condemned Layer: Rows exceeding max attempts or exact duplicates are moved to condemned layer (no automatic retries). Human review and approval required before reprocessing or deletion.
- Loop Prevention: TransactionID deduplication, metadata enrichment, duplicate detection, and attempt limit enforcement prevent infinite retry loops.
- Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window.
- Partitioning:
- Bronze: Partitioned by
ingest_date(when it arrived) - Silver: Partitioned by
year/month(business-time, derived fromTransactionTimestamp) - Gold: Partitioned by business dimensions (e.g.,
as_of_month)
- Bronze: Partitioned by
- Schema Versioning: All Silver and Gold paths include
schema_v={version}to support schema evolution. - Object Store Safety: Use write-then-publish pattern with
_SUCCESSmarkers; treat partition discovery as eventually consistent.
3. Safe Publishing Pattern & Authorityβ
Write-Then-Publish Patternβ
Each dataset run follows this pattern:
- Writes data files to a
run_id-scoped prefix - Writes a
_SUCCESSmarker once complete - Consumers only read runs with
_SUCCESSmarkers present
Authority Mechanismsβ
Two mechanisms manage authoritative datasets:
_LATEST.json: Control-plane object pointing to the authoritative run (used for promotion/rollback)current/Prefix: Stable location for Amazon Athena queries via AWS Glue Data Catalog tables
Detailed publishing patterns: See Safe Publishing Pattern & Authority in Appendix F.
4. Schema Evolution Strategyβ
As business requirements change, our architecture must adapt without downtime. We manage schema evolution with:
- Backward-compatible reads: Updated consumers can read both old and new data
- Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
- Schema versioning: All changes use new
schema_vversions - Parquet-only (current): Additive changes only, with Glue Data Catalog as source of truth
Detailed schema evolution strategy: See Schema Evolution Strategy in Appendix D for compatibility policies, breaking change playbooks, and type change examples.
5. Backfills & Reprocessingβ
- Backfills recompute only affected partitions (e.g., reprocess
year=2024/month=03) - New outputs are written under a new
run_id(e.g.,run_id=20260128_BACKFILL) - Promotion occurs by updating
_LATEST.jsonandcurrent/after validation - No historical data is overwritten; all
run_idpaths are preserved for audit
Detailed backfill flow: See Backfills & Reprocessing in Appendix D.
6. Ownership & Governanceβ
| Layer | Owner | Steward | Responsibility | Implementation Status |
|---|---|---|---|---|
| Bronze | Data Platform Team | Ingestion Lead | Immutability, raw data preservation, ingestion reliability | β Implemented (Task 1) |
| Silver | Domain Teams | Domain Analyst | Validation logic, schema evolution, data quality | β Implemented (Task 1) |
| Gold | Business (Finance) | Finance Controller | Business contracts, reporting accuracy, stakeholder communication | Task 2: Complete Architecture Design, Task 3: SQL Aggregation |
Rulesβ
- Bronze is immutable and restricted: Only platform team can write; no direct business user access
- Silver schema changes require domain + platform review: Domain team proposes, platform team implements
- Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via
schema_v
Detailed governance workflow: See Ownership & Governance in Appendix D.
7. Architecture Diagramβ
π FLOW PERSPECTIVE: This diagram shows the dynamic data flowβhow data moves and transforms through the system from ingestion to consumption. Think of this as answering: "How does data flow through the pipeline and what processes transform it?"
This simplified diagram shows the high-level data flow through the lake:
Detailed architecture diagram: See Complete Architecture Diagram in Appendix D for the full diagram with all components, AWS services, and implementation status.
Key Componentsβ
- Bronze Layer: Raw, immutable CSV files (S3)
- ETL Pipeline: AWS Glue PySpark job validates and transforms data
- Silver Layer: Validated Parquet files partitioned by year/month (S3)
- Quarantine: Invalid rows preserved for audit (S3)
- Gold Layer: Business aggregates with
_LATEST.jsonandcurrent/prefix (S3) - Athena: SQL query engine reading via Glue Data Catalog
- Glue Data Catalog: Table metadata and partition registration
Athena & Glue Separation: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines to use the same Glue Catalog metadata.
System Context Diagram (C4)β
This C4 diagram shows the system context and relationships between users and systems:
8. Failure Mode Analysisβ
This section documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.
Key Failure Scenariosβ
- If
_SUCCESSMarker is Missing: Consumers cannot identify complete runs, leading to incomplete queries - If
_LATEST.jsonis Missing: No authoritative pointer to current dataset, making promotion/rollback impossible - If Glue Data Catalog is Missing: Athena cannot locate tables, all SQL queries fail
- If Quarantine Layer is Removed: Invalid data is silently dropped, audit trail lost
- If Bronze Layer is Overwritten: Historical audit trail lost; backfills become impossible
- If
run_idIsolation is Removed: Reruns overwrite previous outputs, causing data loss - If Schema Versioning is Removed: Schema changes break existing consumers; no backward compatibility
System Resilience Principlesβ
- Fail-Safe Defaults: System fails in a safe state (no partial data published)
- Defense in Depth: Multiple layers of validation prevent single points of failure
- Auditability: All operations are logged (CloudTrail, CloudWatch)
- Immutability: Critical layers (Bronze) are append-only
- Idempotency: Reruns are safe; each run writes to unique paths
Detailed failure mode analysis: See Failure Mode Analysis in Appendix D for complete impact, symptoms, root causes, detection, and mitigation strategies.
Related Documentationβ
Input Sourcesβ
- ETL Pipeline Assumptions - What writes to this architecture
- ETL Flow & Pseudocode - Visual representation and pseudocode of data flow
- Human Validation Policy - Approval workflows for data operations
Query Layerβ
- SQL Query - Example query on this architecture
- SQL Breakdown - Query design rationale and structure
Infrastructureβ
- CI/CD Workflow - How this architecture is deployed
- Terraform Configuration - Infrastructure as code
Technical Analysisβ
- AWS Services Analysis - Service selection rationale
- PySpark Optimization - Performance considerations
- Complete Architecture Details - Detailed folder structures, schema evolution, governance workflows, and failure mode analysis