Testing & Validation Guide

This document describes the comprehensive test suite for validating both Python and SQL code in the Ohpen case study implementation.

Overview

Python Tests (Task 1 - ETL)

Location

tasks/01_data_ingestion_transformation/tests/

Test Files

test_etl.py - Unit tests for validation logic
- Valid data handling
- Invalid currency detection
- Null value handling
- Malformed timestamp detection
- Partition column generation
- Run ID in quarantine
- Empty DataFrame edge case
- All rows quarantined edge case
- Deterministic behavior
test_integration.py - End-to-end integration tests
- Full ETL workflow with temp files
- Parquet partition structure validation
- Success marker validation
- Quarantine data validation
- Deterministic output verification

Running Python Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/01_data_ingestion_transformation

# Install dependencies

pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run linting

ruff check src/ tests/

# Run all tests

export PYTHONPATH=$PYTHONPATH:$(pwd)
pytest tests/ -v

# Run specific test file

pytest tests/test_etl.py -v
pytest tests/test_integration.py -v

Script Tests (Workspace & Task Scripts)

Location

tests/scripts/ - Tests for workspace-level scripts
tasks/05_communication_documentation/tests/ - Tests for task-level scripts

Test Files

tests/scripts/test_create_submission_bundles.py - Tests for submission bundle creation script
- Script existence and permissions
- Bash syntax validation
- Directory structure creation
- File copying logic
- ZIP file generation
- Error handling
tasks/05_communication_documentation/tests/test_compile_deliverables.py - Tests for PDF compilation script
- Script syntax validation
- File reference checks
- Pandoc integration
- Error handling

Running Script Tests

# From project root
cd /home/stephen/projects/ohpen-case-2026

# Run all script tests
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

# Or use Makefile
make test-scripts

# Run specific test
pytest tests/scripts/test_create_submission_bundles.py -v

SQL Tests (Task 3 - Balance Query)

Location (2)

tasks/03_sql/tests/

Test Files (2)

test_balance_query.py - DuckDB-based SQL tests
- SQL syntax validation
- Expected results verification
- Missing months handling (NULL values)
- Last transaction of month selection
test_data.sql - Sample data from Appendix A
expected_output.csv - Expected query results

Running SQL Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/03_sql

# Install dependencies (2)

pip install -r requirements.txt

# Run SQL linting

sqlfluff lint balance_history_2024_q1.sql --dialect postgres

# Run SQL unit tests

pytest tests/ -v

CI/CD Integration

The GitHub Actions workflow (.github/workflows/ci.yml) automatically runs all tests:

Jobs

python-validation
- Ruff linting on source and test files
- All pytest unit and integration tests
sql-validation
- SQLFluff linting on SQL queries
- DuckDB-based SQL unit tests

Triggering CI

git add .
git commit -m "Add comprehensive test suite"
git push origin master

```text

## Test Coverage

### Python ETL Coverage

- ✅ Validation logic (nulls, currency, timestamp)
- ✅ Partition generation (year/month extraction)
- ✅ Quarantine mechanism
- ✅ Success marker creation
- ✅ Edge cases (empty, all invalid)
- ✅ Deterministic behavior (backfill safety)
- ✅ End-to-end workflow
- ✅ Metadata enrichment (row_hash, attempt_count)
- ✅ Loop prevention (duplicate detection, attempt limits)
- ✅ Circuit breaker logic
- ✅ Condemned layer handling

### Script Coverage

- ✅ Bash script syntax validation
- ✅ Script existence and permissions
- ✅ Directory creation logic
- ✅ File copying operations
- ✅ Error handling
- ✅ Output validation

### SQL Query Coverage

- ✅ Syntax validation (PostgreSQL/ANSI dialect)
- ✅ Expected output verification
- ✅ NULL handling for missing months
- ✅ Last transaction selection logic
- ✅ Cross join (account × month spine)

## Testing Strategy Decisions

### Why DuckDB instead of real PostgreSQL/Aurora

- **Speed**: DuckDB runs in-memory, tests complete in milliseconds
- **Portability**: No external dependencies, works in any CI environment
- **Sufficient**: Validates SQL syntax and logic (the case study requirement)
- **Pragmatic**: Avoids over-engineering for a case study submission

### Why Temp Files instead of Testcontainers for Python

For this case study, we use temporary directories + mocking instead of Testcontainers (Docker-in-Docker with real S3).

## Rationale

- **Fast**: Tests run in <2 seconds vs 10+ seconds with container startup
- **Simple**: Fewer dependencies, easier to debug
- **Validates core logic**: Partition generation, validation, quarantine handling
- **Interview scope**: Demonstrates testing best practices without over-engineering

## When to use Testcontainers (production context)

- Testing S3-specific features (multipart uploads, versioning, lifecycle policies)
- Integration tests requiring exact AWS SDK behavior
- End-to-end workflows with Glue, Lambda, Step Functions
- Testing against multiple database engines simultaneously

**Trade-off acknowledged**: Our tests validate transformation logic but not S3 API edge cases. For production deployment at Ohpen, we would add Testcontainers-based integration tests as a second layer.

## Benefits for Interview

This test suite demonstrates:

1. **Production Readiness**: Comprehensive validation before deployment
2. **Data Quality**: Automated checks prevent data corruption
3. **Backfill Safety**: Determinism tests ensure reproducible outputs
4. **Best Practices**: Linting, unit tests, integration tests, CI/CD
5. **Documentation**: Tests serve as executable examples
6. **Pragmatic Engineering**: Chose the right tool for the scope (DuckDB, temp files) while knowing when to upgrade (Testcontainers for production)

## Quick Validation

### Option 1: Docker (Recommended)

Run all tests in isolated Docker environment:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Run all tests (Python + SQL)

make test

# Or run specific tests

make test-python   # Python ETL tests only
make test-sql      # SQL query tests only
make test-scripts  # Script tests only (runs locally, not in Docker)

```text

## Option 2: Local Environment

If you prefer to run tests locally:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Python tests

cd tasks/01_data_ingestion_transformation
pip install -r requirements.txt -r requirements-dev.txt
ruff check src/ tests/
pytest tests/ -v

# SQL tests

cd ../03_sql
pip install -r requirements.txt
sqlfluff lint balance_history_2024_q1.sql --dialect postgres
pytest tests/ -v

# Script tests

cd ../..
pip install pytest
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

Expected output: All tests passing ✅

Overview​

Python Tests (Task 1 - ETL)​

Location​

Test Files​

Running Python Tests​

Script Tests (Workspace & Task Scripts)​

Location​

Test Files​

Running Script Tests​

SQL Tests (Task 3 - Balance Query)​

Location (2)​

Test Files (2)​

Running SQL Tests​

CI/CD Integration​

Jobs​

Triggering CI​

Overview

Python Tests (Task 1 - ETL)

Location

Test Files

Running Python Tests

Script Tests (Workspace & Task Scripts)

Location

Test Files

Running Script Tests

SQL Tests (Task 3 - Balance Query)

Location (2)

Test Files (2)

Running SQL Tests

CI/CD Integration

Jobs

Triggering CI