SQL Query Breakdown: From Inner to Outer

This document breaks down the balance_history_2024_q1.sql query from the most inner table working outwards, explaining each CTE and the final SELECT statement.

Query Structure Overview

The query uses a nested CTE (Common Table Expression) approach, building from simple to complex:

Textual Flow:

transactions (base table)
    ↓
months (CTE 1 - innermost)
    ↓
unique_accounts (CTE 2)
    ↓
account_months (CTE 3)
    ↓
monthly_last_tx (CTE 4)
    ↓
Final SELECT (outermost)

High-Level Algorithm

The query follows a systematic approach to generate month-end balances:

FUNCTION generate_month_end_balances():
    // Step 1: Generate Month Spine
    months = ['2024-01-31', '2024-02-29', '2024-03-31']

    // Step 2: Get Unique Accounts
    unique_accounts = SELECT DISTINCT account_id
                      FROM transactions
                      WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

    // Step 3: Generate Account×Month Spine (Cartesian Product)
    account_months = CROSS_JOIN(unique_accounts, months)
    // Result: All combinations of (account_id, month_end)

    // Step 4: Find Last Transaction per Account per Month
    monthly_last_tx = SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC
        ) AS row_number
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

    // Filter to only the last transaction (row_number = 1)
    last_tx_per_month = FILTER monthly_last_tx WHERE row_number = 1

    // Step 5: Join Spine with Data
    result = SELECT
        account_months.account_id AS acct,
        last_tx_per_month.new_balance AS balance,
        TO_CHAR(account_months.month_end, 'YYYY-MM') AS month
    FROM account_months
    LEFT JOIN last_tx_per_month
        ON account_months.account_id = last_tx_per_month.account_id
        AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start

    // Step 6: Order Results
    ORDER BY account_months.account_id, account_months.month_end

    RETURN result
END FUNCTION

Query Objective

Generate month-end account balances for Q1 2024 (January, February, March), showing NULL for months with no transactions.

Key Design Decisions

Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit
Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans
LEFT JOIN: Preserves all account×month combinations, showing NULL for months with no activity
Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables
Deterministic Ordering: id DESC as tie-breaker ensures consistent results

Performance Optimizations

For 100M row tables:

Partitioning Strategy:
- Table partitioned by tx_date (month/day)
- WHERE clause enables partition pruning
- Only scans Jan-Mar 2024 partitions
Parquet Optimization (Athena):
- Parquet column statistics (min/max) enable efficient filtering
- Partition metadata enables partition pruning
- Speeds up DISTINCT and window function operations
- Reduces data scanned
Window Function Efficiency:
- Single pass over data (no multiple scans)
- Partitioned by account_id and month
- Ordered by tx_date DESC for last transaction
Join Strategy:
- LEFT JOIN preserves spine (no data loss)
- Join on account_id and month (optimized via partition pruning and Parquet stats)

Breakdown: Inner to Outer

1. Base Table: `transactions` (Innermost - Data Source)

Code Reference: See Base Table Filter in Appendix.

What it does:

Reads from the transactions table (as defined in Appendix A of the business case)
Filters to Q1 2024 (January, February, March)
Partition pruning: If table is partitioned by tx_date, only scans relevant partitions

Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on Task 1 output, a schema transformation/mapping step is required:

TransactionID → id (or generate sequential id)
CustomerID → account_id
TransactionAmount → amount
TransactionTimestamp → tx_date
new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Columns used:

account_id - Account identifier
new_balance - Balance after transaction (per Appendix A spec)
tx_date - Transaction date
id - Transaction ID (for tie-breaking)

2. CTE 1: `months` (First CTE - Month Spine)

Code Reference: See CTE 1: months in Appendix.

What it does:

Creates a hardcoded list of month-end dates for Q1 2024
Purpose: Defines the reporting period explicitly
Result: 3 rows (one per month)

Why month-end dates?

January 31, February 29 (2024 is leap year), March 31
These represent the "as-of" dates for month-end reporting

3. CTE 2: `unique_accounts` (Second CTE - Account Universe)

Code Reference: See CTE 2: unique_accounts in Appendix.

What it does:

Extracts all unique account IDs from the transactions table
Purpose: Defines the "universe" of accounts to report on
Result: One row per unique account

Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.

4. CTE 3: `account_months` (Third CTE - Complete Spine)

Code Reference: See CTE 3: account_months in Appendix.

What it does:

CROSS JOIN: Creates Cartesian product of all accounts × all months
Purpose: Generates a complete "spine" of (account, month) combinations
Result: If there are N accounts and 3 months, produces N×3 rows

Why CROSS JOIN?

Ensures every account appears for every month
Months with no transactions will show NULL balance (explicit gaps)
Without this, accounts with no activity in a month would be missing from results

Example output:

account_id | month_end
-----------|----------
ACC001     | 2024-01-31
ACC001     | 2024-02-29
ACC001     | 2024-03-31
ACC002     | 2024-01-31
ACC002     | 2024-02-29
ACC002     | 2024-03-31
...

5. CTE 4: `monthly_last_tx` (Fourth CTE - Last Transaction per Month)

Code Reference: See CTE 4: monthly_last_tx in Appendix.

What it does:

Window Function: ROW_NUMBER() partitions by account and month
Ordering: tx_date DESC, id DESC - most recent transaction first
Purpose: Identifies the last transaction for each account in each month
Filter: Only rows where rn = 1 are the last transactions

How it works:

Partitions transactions by (account_id, month)
Within each partition, orders by tx_date DESC (newest first)
Assigns row number 1 to the most recent transaction
If multiple transactions on same day, uses id DESC as tie-breaker

Example:

Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction

Result: One row per (account, month) combination that has transactions

6. Final SELECT (Outermost - Final Report)

Code Reference: See Final SELECT in Appendix.

What it does:

LEFT JOIN: Joins the complete spine (account_months) with the last transactions (monthly_last_tx)
Join conditions:
1. account_id matches
2. Month matches (using DATE_TRUNC to align month-end with month-start)
3. Only last transaction (rn = 1)
Result: One row per (account, month) combination
- If transaction exists: shows new_balance
- If no transaction: shows NULL (from LEFT JOIN)

Why LEFT JOIN?

Preserves all rows from account_months (the spine)
Accounts with no transactions in a month get NULL balance
Without LEFT JOIN, those months would be missing from results

Final output format:

acct   | balance | month
-------|---------|--------
ACC001 | 326.00  | 2024-01
ACC001 | NULL    | 2024-02
ACC001 | NULL    | 2024-03
ACC002 | NULL    | 2024-01
ACC002 | -25.50  | 2024-02
ACC002 | -125.50 | 2024-03
...

Execution Order (How SQL Engine Processes It)

Step-by-Step:

Base table scan: transactions table (with partition pruning)
CTE 1: months - Hardcoded values (instant)
CTE 2: unique_accounts - DISTINCT on transactions (requires scan)
CTE 3: account_months - CROSS JOIN (fast, small result set)
CTE 4: monthly_last_tx - Window function on transactions (requires full scan + sort)
Final SELECT: LEFT JOIN between spine and last transactions

Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.

Key Design Patterns

1. Spine Generation Pattern

Problem: Need to show all account×month combinations, even when no data exists
Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
Result: Gaps are explicit (NULL) rather than missing rows

2. Window Function Pattern

Problem: Need last transaction per account per month efficiently
Solution: ROW_NUMBER() with partitioning and ordering
Result: Single pass over data (O(n log n) vs O(n²) for nested loops)

3. Partition Pruning Pattern

Problem: 100M row table, but only need Q1 2024
Solution: WHERE clause on partitioned column (tx_date)
Result: Only scans relevant partitions (Jan, Feb, Mar 2024)

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

Code Reference: See Alternative 1: Simple GROUP BY in Appendix.

Problem: Doesn't show months with no transactions (missing rows)

Alternative 2: SUM(amount) (Wrong per spec)

Code Reference: See Alternative 2: SUM(amount) in Appendix.

Problem: Spec says to use new_balance from last transaction, not sum amounts

Our Approach: Spine + Window Function (Correct)

✅ Shows all account×month combinations (spine)
✅ Uses new_balance from last transaction (per spec)
✅ Efficient for 100M rows (window function, partition pruning)
✅ Handles NULL months correctly

Gold Layer Aggregation

This query demonstrates the Silver → Gold aggregation pattern. The Gold layer structure, governance, and ownership model are best described in Task 2 (Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.

Storing Results in Gold Layer

The query output represents month-end balance snapshots that can be persisted.

Code Reference: See Gold Layer Storage Example in Appendix.

Benefits of Gold Layer Storage

Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
Cost: Reduces Athena query costs (scan less data)
Freshness: Can be refreshed monthly after month-end close
Governance: Gold layer owned by Business/Finance (Gold layer structure, governance, and ownership model are best described in Task 2 architecture)
Stability: Stable schema and business contracts for reporting

Monthly Refresh Pattern

For monthly reporting, this query would run:

After month-end close (e.g., first day of next month)
Compute month-end balances from Silver layer (this query)
Store results in Gold layer (s3://aggregated/monthly_balances/year=YYYY/month=MM/)
Update Glue Data Catalog for Athena queries

This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.

SQL Query Breakdown: From Inner to Outer

Query Structure Overview

High-Level Algorithm

Query Objective

Key Design Decisions

Performance Optimizations

Breakdown: Inner to Outer

1. Base Table: `transactions` (Innermost - Data Source)

2. CTE 1: `months` (First CTE - Month Spine)

3. CTE 2: `unique_accounts` (Second CTE - Account Universe)

4. CTE 3: `account_months` (Third CTE - Complete Spine)

5. CTE 4: `monthly_last_tx` (Fourth CTE - Last Transaction per Month)

6. Final SELECT (Outermost - Final Report)

Execution Order (How SQL Engine Processes It)

Key Design Patterns

1. Spine Generation Pattern

2. Window Function Pattern

3. Partition Pruning Pattern

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

Alternative 2: SUM(amount) (Wrong per spec)

Our Approach: Spine + Window Function (Correct)

Gold Layer Aggregation

Storing Results in Gold Layer

Benefits of Gold Layer Storage

Monthly Refresh Pattern

See Also

Task 3 Documentation

Technical Documentation

Query Structure Overview​

High-Level Algorithm​

Query Objective​

Key Design Decisions​

Performance Optimizations​

Breakdown: Inner to Outer​

1. Base Table: transactions (Innermost - Data Source)​

2. CTE 1: months (First CTE - Month Spine)​

3. CTE 2: unique_accounts (Second CTE - Account Universe)​

4. CTE 3: account_months (Third CTE - Complete Spine)​

5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)​

6. Final SELECT (Outermost - Final Report)​

Execution Order (How SQL Engine Processes It)​

Key Design Patterns​

1. Spine Generation Pattern​

2. Window Function Pattern​

3. Partition Pruning Pattern​

Why This Approach?​

Alternative 1: Simple GROUP BY (Wrong)​

Alternative 2: SUM(amount) (Wrong per spec)​

Our Approach: Spine + Window Function (Correct)​

Gold Layer Aggregation​

Storing Results in Gold Layer​

Benefits of Gold Layer Storage​

Monthly Refresh Pattern​

See Also​

Task 3 Documentation​

Related Tasks​

Technical Documentation​

Query Structure Overview

High-Level Algorithm

Query Objective

Key Design Decisions

Performance Optimizations

Breakdown: Inner to Outer

1. Base Table: `transactions` (Innermost - Data Source)

2. CTE 1: `months` (First CTE - Month Spine)

3. CTE 2: `unique_accounts` (Second CTE - Account Universe)

4. CTE 3: `account_months` (Third CTE - Complete Spine)

5. CTE 4: `monthly_last_tx` (Fourth CTE - Last Transaction per Month)

6. Final SELECT (Outermost - Final Report)

Execution Order (How SQL Engine Processes It)

Key Design Patterns

1. Spine Generation Pattern

2. Window Function Pattern

3. Partition Pruning Pattern

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

Alternative 2: SUM(amount) (Wrong per spec)

Our Approach: Spine + Window Function (Correct)

Gold Layer Aggregation

Storing Results in Gold Layer

Benefits of Gold Layer Storage

Monthly Refresh Pattern

See Also

Task 3 Documentation

Related Tasks

Technical Documentation