Skip to main content

SQL Script Isolated Testing

Overview

The SQL query (balance_history_2024_q1.sql) is tested in complete isolation using DuckDB, an in-memory SQL database. This approach:

No Data Lake Connection Required - Tests run entirely locally
Fast Execution - In-memory database, no network overhead
Reproducible - Same test data every time
Portable - Works on any machine with Python
CI/CD Friendly - No AWS credentials or infrastructure needed

Test Architecture

┌─────────────────────────────────────────┐
│ Test Runner (pytest) │
│ │
│ ┌──────────────────────────────────┐ │
│ │ DuckDB (In-Memory Database) │ │
│ │ - schema.sql (table definition) │ │
│ │ - test_data.sql (sample data) │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ balance_history_2024_q1.sql │ │
│ │ (query under test) │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ expected_output.csv │ │
│ │ (validation) │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘

Test Data

The test uses synthetic test data from tests/test_data.sql:

  • 3 accounts: ACC001, ACC002, ACC003
  • 20 transactions across 2024
  • Covers edge cases:
    • Multiple transactions per month
    • Missing months (NULL balances)
    • Different account activity patterns

Running Tests

# Run tests in Docker (no setup required!)
make test

This will:

  1. Build Docker image with all dependencies
  2. Run tests in isolated container
  3. Clean up automatically

Manual Execution (Local)

# Option 1: Use Makefile with virtual environment
make test-isolated

# Option 2: Manual setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest tests/test_balance_query.py -v

Test Coverage

✅ Syntax Validation

  • Validates SQL syntax is correct
  • Adapts query for DuckDB (TO_CHAR → STRFTIME)

✅ Logic Validation

  • Verifies month-end balance calculation
  • Tests last transaction selection within month
  • Validates account×month spine generation

✅ Edge Cases

  • Missing months: Accounts with no transactions return NULL
  • Multiple transactions: Picks last transaction of month
  • Tie-breaking: Uses id as tie-breaker when dates are equal

✅ Expected Output

  • Compares results against expected_output.csv
  • Validates all accounts and months
  • Checks NULL handling

Test Files

FilePurpose
tests/test_balance_query.pyTest suite (pytest)
tests/test_data.sqlSynthetic test data
tests/expected_output.csvExpected query results
schema.sqlTable schema definition
balance_history_2024_q1.sqlQuery under test

Why Isolation?

Problem: Data Lake Dependency

  • Silver layer data requires:
    • AWS credentials
    • S3 access
    • Parquet file reading
    • Partition discovery
    • Complex setup

Solution: In-Memory Testing

  • DuckDB: Fast, in-memory SQL engine
  • Test Data: Synthetic data that mirrors real structure
  • No Infrastructure: Runs anywhere Python runs
  • Fast: Milliseconds instead of seconds/minutes

Test Data Structure

The test data mirrors the Silver layer structure:

CREATE TABLE transactions (
id INT PRIMARY KEY,
account_id VARCHAR(100),
amount NUMERIC(16,2),
new_balance NUMERIC(16,2),
tx_date TIMESTAMP
);

Query Adaptations

The SQL query uses standard ANSI SQL, but DuckDB requires minor adaptations:

-- Original (Presto/Athena)
TO_CHAR(am.month_end, 'YYYY-MM')

-- DuckDB adaptation
STRFTIME(am.month_end, '%Y-%m')

These adaptations are handled automatically in the test suite.

Integration with CI/CD

The isolated tests integrate seamlessly with CI/CD:

# .github/workflows/ci.yml
sql-validation:
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v3
- name: Run SQL tests in Docker

run: |
cd tasks/03_sql
make test

Option 2: Direct Python

sql-validation:
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v3
- name: Set up Python

uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install dependencies

run: pip install -r tasks/03_sql/requirements.txt

- name: Run SQL tests

run: |
cd tasks/03_sql
pytest tests/ -v

No AWS credentials needed!
Dockerized for consistency!

Benefits

  1. Fast Feedback - Tests run in milliseconds
  2. No Dependencies - No AWS, S3, or data lake setup
  3. Reproducible - Same results every time
  4. Portable - Works on any machine
  5. CI/CD Ready - No infrastructure setup required
  6. Cost Effective - No AWS charges for testing

Limitations

⚠️ Note: These tests validate SQL logic, not data lake integration.

For integration testing with actual Silver layer data:

  • Requires AWS credentials
  • Needs S3 access to Silver layer
  • Requires Parquet file reading
  • More complex setup

Recommendation: Use isolated tests for development and CI/CD. Use integration tests only for final validation before deployment.

Example Test Output

tests/test_balance_query.py::test_balance_query_syntax_valid PASSED
tests/test_balance_query.py::test_balance_query_returns_expected_results PASSED
tests/test_balance_query.py::test_query_handles_missing_months PASSED
tests/test_balance_query.py::test_query_picks_last_transaction_of_month PASSED

======================== 4 passed in 0.15s ========================

Summary

SQL tests run in complete isolation
No data lake connection required
Fast, reproducible, portable
Perfect for CI/CD integration

The SQL query logic is thoroughly tested without any external dependencies!


Task 3 Documentation

Technical Documentation