Task 1: Assumptions & Edge Cases
This document captures the explicit assumptions and edge cases considered for the ETL pipeline.
⚠️ Scope Disclaimer: These assumptions are based on the case study requirements and may need validation with the team to ensure they align with actual data volumes, error rates, production compliance requirements, and operational constraints.
Input Assumptions
Schema / Columns
- Required columns are exactly:
TransactionID,CustomerID,TransactionAmount,Currency,TransactionTimestamp - If any required column is missing, the job fails fast with a clear error
Currency Handling
- Currency codes are validated against an ISO-4217 allowlist subset for the case study
- Assumption: codes are provided as 3-letter uppercase strings in the input
Timestamp Handling
TransactionTimestampis parsed using pandas withutc=True- Assumption: timestamp is either ISO-like (e.g.,
2024-01-01T10:00:00Z) or parseable by pandas - Partitioning uses the parsed timestamp's
yearandmonth(business-time partition)
Amount Handling
- Negative amounts are allowed (e.g., withdrawals/refunds)
- Detecting suspicious values is a separate outlier/fraud control, not a schema validation rule
Operational Assumptions
Idempotency & Reruns
- Each run writes under a unique
run_id(run isolation) - Reruns/backfills do not overwrite previous run's outputs directly
Quarantine as Audit Trail
- We never silently drop data. Invalid rows remain accessible for audit/debug
- All quarantine data includes full metadata for traceability
Data Retention
- Condemned data: 7-year retention for compliance/audit
- Lifecycle: Transitions to Glacier after 5 years, deletion after 7 years requires human approval
- Rationale: Mortgages can span decades, but condemned data >7 years old with no relevance can be safely deleted after human review
Edge Cases Handled
Schema Validation Errors
- Missing required columns or type mismatches are quarantined with
validation_error = "SCHEMA_ERROR"
Missing Required Fields
- Any row missing one or more required fields is quarantined with
validation_error = "NULL_VALUE_ERROR" attempt_countis incremented
Invalid Currency Codes
- Rows with currencies outside the allowlist are quarantined with
validation_error = "CURRENCY_ERROR" attempt_countis incremented
Invalid Amount Types
- Rows where
TransactionAmountis not parseable as a number are quarantined withvalidation_error = "TYPE_ERROR" attempt_countis incremented
Malformed/Unparseable Timestamps
- Rows where
TransactionTimestampcannot be parsed are quarantined withvalidation_error = "TIMESTAMP_ERROR" attempt_countis incremented
Duplicate Account/Date Combinations
- Rows with duplicate
CustomerID+TransactionTimestamp(date part) combinations are flagged and quarantined validation_error = "Duplicate account/date combination"(or appended to existing errors)attempt_countis incremented- Note: This is business logic duplicate detection (separate from exact hash duplicate detection)
- Note: Duplicates are preserved (not dropped) for audit and business review
Exact Duplicate Detection (Loop Prevention)
- Rows with exact
row_hashmatch in quarantine history are auto-condemned withvalidation_error = "DUPLICATE_FAILURE" - Row is moved to condemned layer (no automatic retries)
Max Attempts Exceeded (Loop Prevention)
- Rows with
attempt_count >= 3(after 3rd failure, ≤3 retries allowed) are auto-condemned withvalidation_error = "MAX_ATTEMPTS" - Row is moved to condemned layer (no automatic retries)
Circuit Breaker Triggered
- When >100 same errors occur within 1 hour:
- Pipeline halts with
RuntimeError - No data is written (valid or quarantine)
- Requires human intervention to investigate and restart
- Pipeline halts with
Multiple Issues in One Row
- If multiple validation issues occur, the error message is preserved/extended so it remains explainable
Empty Input
- If the DataFrame is empty, the job produces no output files and logs a warning (no crash)
All Rows Invalid
- If every row is invalid, processed output is empty and all rows land in quarantine (still a valid outcome)
Intentionally Out of Scope
These are realistic production requirements but were intentionally not implemented to keep the solution aligned with the case study scope:
- Cross-run deduplication (e.g., repeated
TransactionIDacross different ingestion runs/files). Within-run duplicates are flagged - Outlier detection (e.g., unusually large values, abnormal patterns) and fraud/risk rules
- Currency conversion (multi-currency reporting requires FX rates, valuation time, and consistent rounding policy)
- Late-arriving corrections beyond rerunning a period (in production you'd define correction events + reconciliation rules)
Where This Is Validated
The edge cases above are validated in:
tests/test_etl.py(unit tests): nulls, invalid currency, malformed timestamps, empty input, all-invalid, determinism, partition columnstests/test_integration.py(integration tests): end-to-end flow, partitioned output structure,_SUCCESSmarker, quarantine output
High-level test documentation: Testing Guide
Related Documentation
Task 1 Documentation
- ETL Flow & Pseudocode - Visual representation and pseudocode of the ETL pipeline
- Full ETL Code - Complete implementation code
- Human Validation Policy - Approval workflows for data operations
- Testing Quick Start - How to run and validate tests
Related Tasks
- Data Lake Architecture - Where this ETL writes validated data
- SQL Query - Example query on processed data
- CI/CD Workflow - How this ETL is deployed and orchestrated
Technical Documentation
- Testing Guide - Comprehensive testing documentation
- Unified Testing Convention - Testing standards across all tasks
- Test Results Overview - Current test execution results