Skip to main content

Testing & Validation Guide

This document describes the comprehensive test suite for validating both Python and SQL code in the Ohpen case study implementation.

Overview

Python Tests (Task 1 - ETL)

Location

  • tasks/01_data_ingestion_transformation/tests/

Test Files

  1. test_etl.py - Unit tests for validation logic

    • Valid data handling
    • Invalid currency detection
    • Null value handling
    • Malformed timestamp detection
    • Partition column generation
    • Run ID in quarantine
    • Empty DataFrame edge case
    • All rows quarantined edge case
    • Deterministic behavior
  2. test_integration.py - End-to-end integration tests

    • Full ETL workflow with temp files
    • Parquet partition structure validation
    • Success marker validation
    • Quarantine data validation
    • Deterministic output verification

Running Python Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/01_data_ingestion_transformation

# Install dependencies

pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run linting

ruff check src/ tests/

# Run all tests

export PYTHONPATH=$PYTHONPATH:$(pwd)
pytest tests/ -v

# Run specific test file

pytest tests/test_etl.py -v
pytest tests/test_integration.py -v

Script Tests (Workspace & Task Scripts)

Location

  • tests/scripts/ - Tests for workspace-level scripts
  • tasks/05_communication_documentation/tests/ - Tests for task-level scripts

Test Files

  1. tests/scripts/test_create_submission_bundles.py - Tests for submission bundle creation script

    • Script existence and permissions
    • Bash syntax validation
    • Directory structure creation
    • File copying logic
    • ZIP file generation
    • Error handling
  2. tasks/05_communication_documentation/tests/test_compile_deliverables.py - Tests for PDF compilation script

    • Script syntax validation
    • File reference checks
    • Pandoc integration
    • Error handling

Running Script Tests

# From project root
cd /home/stephen/projects/ohpen-case-2026

# Run all script tests
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

# Or use Makefile
make test-scripts

# Run specific test
pytest tests/scripts/test_create_submission_bundles.py -v

SQL Tests (Task 3 - Balance Query)

Location (2)

  • tasks/03_sql/tests/

Test Files (2)

  1. test_balance_query.py - DuckDB-based SQL tests

    • SQL syntax validation
    • Expected results verification
    • Missing months handling (NULL values)
    • Last transaction of month selection
  2. test_data.sql - Sample data from Appendix A

  3. expected_output.csv - Expected query results

Running SQL Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/03_sql

# Install dependencies (2)

pip install -r requirements.txt

# Run SQL linting

sqlfluff lint balance_history_2024_q1.sql --dialect postgres

# Run SQL unit tests

pytest tests/ -v

CI/CD Integration

The GitHub Actions workflow (.github/workflows/ci.yml) automatically runs all tests:

Jobs

  1. python-validation

    • Ruff linting on source and test files
    • All pytest unit and integration tests
  2. sql-validation

    • SQLFluff linting on SQL queries
    • DuckDB-based SQL unit tests

Triggering CI

git add .
git commit -m "Add comprehensive test suite"
git push origin master

```text

## Test Coverage

### Python ETL Coverage

- ✅ Validation logic (nulls, currency, timestamp)
- ✅ Partition generation (year/month extraction)
- ✅ Quarantine mechanism
- ✅ Success marker creation
- ✅ Edge cases (empty, all invalid)
- ✅ Deterministic behavior (backfill safety)
- ✅ End-to-end workflow
- ✅ Metadata enrichment (row_hash, attempt_count)
- ✅ Loop prevention (duplicate detection, attempt limits)
- ✅ Circuit breaker logic
- ✅ Condemned layer handling

### Script Coverage

- ✅ Bash script syntax validation
- ✅ Script existence and permissions
- ✅ Directory creation logic
- ✅ File copying operations
- ✅ Error handling
- ✅ Output validation

### SQL Query Coverage

- ✅ Syntax validation (PostgreSQL/ANSI dialect)
- ✅ Expected output verification
- ✅ NULL handling for missing months
- ✅ Last transaction selection logic
- ✅ Cross join (account × month spine)

## Testing Strategy Decisions

### Why DuckDB instead of real PostgreSQL/Aurora

- **Speed**: DuckDB runs in-memory, tests complete in milliseconds
- **Portability**: No external dependencies, works in any CI environment
- **Sufficient**: Validates SQL syntax and logic (the case study requirement)
- **Pragmatic**: Avoids over-engineering for a case study submission

### Why Temp Files instead of Testcontainers for Python

For this case study, we use temporary directories + mocking instead of Testcontainers (Docker-in-Docker with real S3).

## Rationale

- **Fast**: Tests run in <2 seconds vs 10+ seconds with container startup
- **Simple**: Fewer dependencies, easier to debug
- **Validates core logic**: Partition generation, validation, quarantine handling
- **Interview scope**: Demonstrates testing best practices without over-engineering

## When to use Testcontainers (production context)

- Testing S3-specific features (multipart uploads, versioning, lifecycle policies)
- Integration tests requiring exact AWS SDK behavior
- End-to-end workflows with Glue, Lambda, Step Functions
- Testing against multiple database engines simultaneously

**Trade-off acknowledged**: Our tests validate transformation logic but not S3 API edge cases. For production deployment at Ohpen, we would add Testcontainers-based integration tests as a second layer.

## Benefits for Interview

This test suite demonstrates:

1. **Production Readiness**: Comprehensive validation before deployment
2. **Data Quality**: Automated checks prevent data corruption
3. **Backfill Safety**: Determinism tests ensure reproducible outputs
4. **Best Practices**: Linting, unit tests, integration tests, CI/CD
5. **Documentation**: Tests serve as executable examples
6. **Pragmatic Engineering**: Chose the right tool for the scope (DuckDB, temp files) while knowing when to upgrade (Testcontainers for production)

## Quick Validation

### Option 1: Docker (Recommended)

Run all tests in isolated Docker environment:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Run all tests (Python + SQL)

make test

# Or run specific tests

make test-python # Python ETL tests only
make test-sql # SQL query tests only
make test-scripts # Script tests only (runs locally, not in Docker)

```text

## Option 2: Local Environment

If you prefer to run tests locally:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Python tests

cd tasks/01_data_ingestion_transformation
pip install -r requirements.txt -r requirements-dev.txt
ruff check src/ tests/
pytest tests/ -v

# SQL tests

cd ../03_sql
pip install -r requirements.txt
sqlfluff lint balance_history_2024_q1.sql --dialect postgres
pytest tests/ -v

# Script tests

cd ../..
pip install pytest
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

Expected output: All tests passing ✅

© 2026 Stephen AdeiCC BY 4.0