Skip to main content

Task 2: Data Lake Architecture Design

Important Note on Scope​

⚠️ Scope Disclaimer: This architecture design is based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture presented here should be viewed as a starting point that would require:

  • Validation against actual data volumes and growth patterns
  • Alignment with existing infrastructure and operational practices
  • Refinement based on real-world compliance and regulatory requirements
  • Collaboration with the Ohpen team to understand production constraints

This design demonstrates architectural thinking and best practices, but would need significant refinement for production deployment.


1. Lake Architecture Overview​

The lake follows a Bronze / Silver / Gold medallion architecture model.

LayerPurposeMutabilityPrimary ConsumersImplementation Status
BronzeRaw audit trailImmutablePlatform engineers, auditβœ… Implemented (Task 1)
SilverValidated analyticsAppend-onlyAnalysts, data scientistsβœ… Implemented (Task 1)
GoldBusiness contractsVersionedFinance, BI, stakeholdersTask 2: Structure Design, Task 3: SQL Aggregation

Note:

  • Task 1 (ETL Pipeline): Implements Bronze β†’ Silver transformation only (does not write Gold layer files)
  • Task 2 (Architecture): Complete data lake architecture design covering all layers (Bronze/Silver/Gold + Quarantine + Condemned). The Gold layer structure, governance, and ownership model are best described in Task 2 (folder organization, promotion workflows, business metric definitions)
  • Task 3 (SQL): Demonstrates Silver β†’ Gold aggregation patterns via SQL queries (shows how to create Gold data)

Data Lifecycle State Diagram​

This state diagram visualizes the data lifecycle and state transitions through the medallion architecture:


2. Folder Structure (S3)​

We adopt a structured Data Lake organization to support scalability, reproducibility, and data quality.

Data Lake Structure Diagram​

πŸ“ STORAGE PERSPECTIVE: This diagram shows the static storage organizationβ€”how data is physically organized in S3 folders, what partitioning schemes are used, and what file formats are stored. Think of this as answering: "Where is the data stored and how is it organized?"

This simplified diagram shows the S3 folder structure with partitioning keys on the arrows:

Detailed folder structure examples: See Complete Folder Structure Examples in Appendix D.

Key Distinction: Storage vs. Flow​

  • Structure Diagram (above): Shows WHERE data livesβ€”the S3 folder hierarchy, partition keys, file formats, and storage organization
  • Architecture Diagram (Section 7): Shows HOW data movesβ€”the process flow from CSV ingestion through ETL transformations to final consumption

These two perspectives are complementary: the structure diagram answers storage questions (e.g., "Where do I find Bronze data from January 2024?"), while the architecture diagram answers process questions (e.g., "What happens to data after it's ingested?").

Key Path Patterns​

  • Bronze: bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz
  • Silver: silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet
  • Quarantine: quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet
  • Condemned: quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet
  • Gold: gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet
  • Gold (current): gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet

Key Design Choices​

  1. run_id Isolation: Every ETL run writes to a unique path. This prevents partial overwrites and allows safe backfills.
  2. Quarantine Layer: We never silently drop data. Bad rows are preserved for audit and debugging with metadata enrichment.
  3. Condemned Layer: Rows exceeding max attempts or exact duplicates are moved to condemned layer (no automatic retries). Human review and approval required before reprocessing or deletion.
  4. Loop Prevention: TransactionID deduplication, metadata enrichment, duplicate detection, and attempt limit enforcement prevent infinite retry loops.
  5. Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window.
  6. Partitioning:
    • Bronze: Partitioned by ingest_date (when it arrived)
    • Silver: Partitioned by year/month (business-time, derived from TransactionTimestamp)
    • Gold: Partitioned by business dimensions (e.g., as_of_month)
  7. Schema Versioning: All Silver and Gold paths include schema_v={version} to support schema evolution.
  8. Object Store Safety: Use write-then-publish pattern with _SUCCESS markers; treat partition discovery as eventually consistent.

3. Safe Publishing Pattern & Authority​

Write-Then-Publish Pattern​

Each dataset run follows this pattern:

  1. Writes data files to a run_id-scoped prefix
  2. Writes a _SUCCESS marker once complete
  3. Consumers only read runs with _SUCCESS markers present

Authority Mechanisms​

Two mechanisms manage authoritative datasets:

  • _LATEST.json: Control-plane object pointing to the authoritative run (used for promotion/rollback)
  • current/ Prefix: Stable location for Amazon Athena queries via AWS Glue Data Catalog tables

Detailed publishing patterns: See Safe Publishing Pattern & Authority in Appendix F.


4. Schema Evolution Strategy​

As business requirements change, our architecture must adapt without downtime. We manage schema evolution with:

  • Backward-compatible reads: Updated consumers can read both old and new data
  • Forward-compatible writes: Updated producers may emit additional fields without breaking older consumers
  • Schema versioning: All changes use new schema_v versions
  • Parquet-only (current): Additive changes only, with Glue Data Catalog as source of truth

Detailed schema evolution strategy: See Schema Evolution Strategy in Appendix D for compatibility policies, breaking change playbooks, and type change examples.


5. Backfills & Reprocessing​

  • Backfills recompute only affected partitions (e.g., reprocess year=2024/month=03)
  • New outputs are written under a new run_id (e.g., run_id=20260128_BACKFILL)
  • Promotion occurs by updating _LATEST.json and current/ after validation
  • No historical data is overwritten; all run_id paths are preserved for audit

Detailed backfill flow: See Backfills & Reprocessing in Appendix D.


6. Ownership & Governance​

LayerOwnerStewardResponsibilityImplementation Status
BronzeData Platform TeamIngestion LeadImmutability, raw data preservation, ingestion reliabilityβœ… Implemented (Task 1)
SilverDomain TeamsDomain AnalystValidation logic, schema evolution, data qualityβœ… Implemented (Task 1)
GoldBusiness (Finance)Finance ControllerBusiness contracts, reporting accuracy, stakeholder communicationTask 2: Complete Architecture Design, Task 3: SQL Aggregation

Rules​

  • Bronze is immutable and restricted: Only platform team can write; no direct business user access
  • Silver schema changes require domain + platform review: Domain team proposes, platform team implements
  • Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via schema_v

Detailed governance workflow: See Ownership & Governance in Appendix D.


7. Architecture Diagram​

πŸ”„ FLOW PERSPECTIVE: This diagram shows the dynamic data flowβ€”how data moves and transforms through the system from ingestion to consumption. Think of this as answering: "How does data flow through the pipeline and what processes transform it?"

This simplified diagram shows the high-level data flow through the lake:

Detailed architecture diagram: See Complete Architecture Diagram in Appendix D for the full diagram with all components, AWS services, and implementation status.

Key Components​

  • Bronze Layer: Raw, immutable CSV files (S3)
  • ETL Pipeline: AWS Glue PySpark job validates and transforms data
  • Silver Layer: Validated Parquet files partitioned by year/month (S3)
  • Quarantine: Invalid rows preserved for audit (S3)
  • Gold Layer: Business aggregates with _LATEST.json and current/ prefix (S3)
  • Athena: SQL query engine reading via Glue Data Catalog
  • Glue Data Catalog: Table metadata and partition registration

Athena & Glue Separation: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines to use the same Glue Catalog metadata.

System Context Diagram (C4)​

This C4 diagram shows the system context and relationships between users and systems:


8. Failure Mode Analysis​

This section documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.

Key Failure Scenarios​

  • If _SUCCESS Marker is Missing: Consumers cannot identify complete runs, leading to incomplete queries
  • If _LATEST.json is Missing: No authoritative pointer to current dataset, making promotion/rollback impossible
  • If Glue Data Catalog is Missing: Athena cannot locate tables, all SQL queries fail
  • If Quarantine Layer is Removed: Invalid data is silently dropped, audit trail lost
  • If Bronze Layer is Overwritten: Historical audit trail lost; backfills become impossible
  • If run_id Isolation is Removed: Reruns overwrite previous outputs, causing data loss
  • If Schema Versioning is Removed: Schema changes break existing consumers; no backward compatibility

System Resilience Principles​

  1. Fail-Safe Defaults: System fails in a safe state (no partial data published)
  2. Defense in Depth: Multiple layers of validation prevent single points of failure
  3. Auditability: All operations are logged (CloudTrail, CloudWatch)
  4. Immutability: Critical layers (Bronze) are append-only
  5. Idempotency: Reruns are safe; each run writes to unique paths

Detailed failure mode analysis: See Failure Mode Analysis in Appendix D for complete impact, symptoms, root causes, detection, and mitigation strategies.


Input Sources​

Query Layer​

Infrastructure​

Technical Analysis​