Task 1: Assumptions & Edge Cases

This document captures the explicit assumptions and edge cases considered for the ETL pipeline.

⚠️ Scope Disclaimer: These assumptions are based on the case study requirements and may need validation with the team to ensure they align with actual data volumes, error rates, production compliance requirements, and operational constraints.

Input Assumptions

Schema / Columns

Required columns are exactly: TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp
If any required column is missing, the job fails fast with a clear error

Currency Handling

Currency codes are validated against an ISO-4217 allowlist subset for the case study
Assumption: codes are provided as 3-letter uppercase strings in the input

Timestamp Handling

TransactionTimestamp is parsed using pandas with utc=True
Assumption: timestamp is either ISO-like (e.g., 2024-01-01T10:00:00Z) or parseable by pandas
Partitioning uses the parsed timestamp's year and month (business-time partition)

Amount Handling

Negative amounts are allowed (e.g., withdrawals/refunds)
Detecting suspicious values is a separate outlier/fraud control, not a schema validation rule

Operational Assumptions

Idempotency & Reruns

Each run writes under a unique run_id (run isolation)
Reruns/backfills do not overwrite previous run's outputs directly

Quarantine as Audit Trail

We never silently drop data. Invalid rows remain accessible for audit/debug
All quarantine data includes full metadata for traceability

Data Retention

Condemned data: 7-year retention for compliance/audit
Lifecycle: Transitions to Glacier after 5 years, deletion after 7 years requires human approval
Rationale: Mortgages can span decades, but condemned data >7 years old with no relevance can be safely deleted after human review

Edge Cases Handled

Schema Validation Errors

Missing required columns or type mismatches are quarantined with validation_error = "SCHEMA_ERROR"

Missing Required Fields

Any row missing one or more required fields is quarantined with validation_error = "NULL_VALUE_ERROR"
attempt_count is incremented

Invalid Currency Codes

Rows with currencies outside the allowlist are quarantined with validation_error = "CURRENCY_ERROR"
attempt_count is incremented

Invalid Amount Types

Rows where TransactionAmount is not parseable as a number are quarantined with validation_error = "TYPE_ERROR"
attempt_count is incremented

Malformed/Unparseable Timestamps

Rows where TransactionTimestamp cannot be parsed are quarantined with validation_error = "TIMESTAMP_ERROR"
attempt_count is incremented

Duplicate Account/Date Combinations

Rows with duplicate CustomerID + TransactionTimestamp (date part) combinations are flagged and quarantined
validation_error = "Duplicate account/date combination" (or appended to existing errors)
attempt_count is incremented
Note: This is business logic duplicate detection (separate from exact hash duplicate detection)
Note: Duplicates are preserved (not dropped) for audit and business review

Exact Duplicate Detection (Loop Prevention)

Rows with exact row_hash match in quarantine history are auto-condemned with validation_error = "DUPLICATE_FAILURE"
Row is moved to condemned layer (no automatic retries)

Max Attempts Exceeded (Loop Prevention)

Rows with attempt_count >= 3 (after 3rd failure, ≤3 retries allowed) are auto-condemned with validation_error = "MAX_ATTEMPTS"
Row is moved to condemned layer (no automatic retries)

Circuit Breaker Triggered

When >100 same errors occur within 1 hour:
- Pipeline halts with RuntimeError
- No data is written (valid or quarantine)
- Requires human intervention to investigate and restart

Multiple Issues in One Row

If multiple validation issues occur, the error message is preserved/extended so it remains explainable

Empty Input

If the DataFrame is empty, the job produces no output files and logs a warning (no crash)

All Rows Invalid

If every row is invalid, processed output is empty and all rows land in quarantine (still a valid outcome)

Intentionally Out of Scope

These are realistic production requirements but were intentionally not implemented to keep the solution aligned with the case study scope:

Cross-run deduplication (e.g., repeated TransactionID across different ingestion runs/files). Within-run duplicates are flagged
Outlier detection (e.g., unusually large values, abnormal patterns) and fraud/risk rules
Currency conversion (multi-currency reporting requires FX rates, valuation time, and consistent rounding policy)
Late-arriving corrections beyond rerunning a period (in production you'd define correction events + reconciliation rules)

Where This Is Validated

The edge cases above are validated in:

tests/test_etl.py (unit tests): nulls, invalid currency, malformed timestamps, empty input, all-invalid, determinism, partition columns
tests/test_integration.py (integration tests): end-to-end flow, partitioned output structure, _SUCCESS marker, quarantine output

High-level test documentation: Testing Guide

Task 1 Documentation

ETL Flow & Pseudocode - Visual representation and pseudocode of the ETL pipeline
Full ETL Code - Complete implementation code
Human Validation Policy - Approval workflows for data operations
Testing Quick Start - How to run and validate tests

Data Lake Architecture - Where this ETL writes validated data
SQL Query - Example query on processed data
CI/CD Workflow - How this ETL is deployed and orchestrated

Technical Documentation

Testing Guide - Comprehensive testing documentation
Unified Testing Convention - Testing standards across all tasks
Test Results Overview - Current test execution results

Input Assumptions​

Schema / Columns​

Currency Handling​

Timestamp Handling​

Amount Handling​

Operational Assumptions​

Idempotency & Reruns​

Quarantine as Audit Trail​

Data Retention​

Edge Cases Handled​

Schema Validation Errors​

Missing Required Fields​

Invalid Currency Codes​

Invalid Amount Types​

Malformed/Unparseable Timestamps​

Duplicate Account/Date Combinations​

Exact Duplicate Detection (Loop Prevention)​

Max Attempts Exceeded (Loop Prevention)​

Circuit Breaker Triggered​

Multiple Issues in One Row​

Empty Input​

All Rows Invalid​

Intentionally Out of Scope​

Where This Is Validated​

Related Documentation​

Task 1 Documentation​

Related Tasks​

Technical Documentation​