Skip to main content

SQL Query Breakdown: From Inner to Outer

This document breaks down the balance_history_2024_q1.sql query from the most inner table working outwards, explaining each CTE and the final SELECT statement.


Query Structure Overview

The query uses a nested CTE (Common Table Expression) approach, building from simple to complex:

Textual Flow:

transactions (base table)

months (CTE 1 - innermost)

unique_accounts (CTE 2)

account_months (CTE 3)

monthly_last_tx (CTE 4)

Final SELECT (outermost)

High-Level Algorithm

The query follows a systematic approach to generate month-end balances:

FUNCTION generate_month_end_balances():
// Step 1: Generate Month Spine
months = ['2024-01-31', '2024-02-29', '2024-03-31']

// Step 2: Get Unique Accounts
unique_accounts = SELECT DISTINCT account_id
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

// Step 3: Generate Account×Month Spine (Cartesian Product)
account_months = CROSS_JOIN(unique_accounts, months)
// Result: All combinations of (account_id, month_end)

// Step 4: Find Last Transaction per Account per Month
monthly_last_tx = SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS row_number
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

// Filter to only the last transaction (row_number = 1)
last_tx_per_month = FILTER monthly_last_tx WHERE row_number = 1

// Step 5: Join Spine with Data
result = SELECT
account_months.account_id AS acct,
last_tx_per_month.new_balance AS balance,
TO_CHAR(account_months.month_end, 'YYYY-MM') AS month
FROM account_months
LEFT JOIN last_tx_per_month
ON account_months.account_id = last_tx_per_month.account_id
AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start

// Step 6: Order Results
ORDER BY account_months.account_id, account_months.month_end

RETURN result
END FUNCTION

Query Objective

Generate month-end account balances for Q1 2024 (January, February, March), showing NULL for months with no transactions.

Key Design Decisions

  1. Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit
  2. Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans
  3. LEFT JOIN: Preserves all account×month combinations, showing NULL for months with no activity
  4. Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables
  5. Deterministic Ordering: id DESC as tie-breaker ensures consistent results

Performance Optimizations

For 100M row tables:

  1. Partitioning Strategy:

    • Table partitioned by tx_date (month/day)
    • WHERE clause enables partition pruning
    • Only scans Jan-Mar 2024 partitions
  2. Parquet Optimization (Athena):

    • Parquet column statistics (min/max) enable efficient filtering
    • Partition metadata enables partition pruning
    • Speeds up DISTINCT and window function operations
    • Reduces data scanned
  3. Window Function Efficiency:

    • Single pass over data (no multiple scans)
    • Partitioned by account_id and month
    • Ordered by tx_date DESC for last transaction
  4. Join Strategy:

    • LEFT JOIN preserves spine (no data loss)
    • Join on account_id and month (optimized via partition pruning and Parquet stats)

Breakdown: Inner to Outer

1. Base Table: transactions (Innermost - Data Source)

Code Reference: See Base Table Filter in Appendix.

What it does:

  • Reads from the transactions table (as defined in Appendix A of the business case)
  • Filters to Q1 2024 (January, February, March)
  • Partition pruning: If table is partitioned by tx_date, only scans relevant partitions

Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on Task 1 output, a schema transformation/mapping step is required:

  • TransactionID → id (or generate sequential id)
  • CustomerID → account_id
  • TransactionAmount → amount
  • TransactionTimestamp → tx_date
  • new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Columns used:

  • account_id - Account identifier
  • new_balance - Balance after transaction (per Appendix A spec)
  • tx_date - Transaction date
  • id - Transaction ID (for tie-breaking)

2. CTE 1: months (First CTE - Month Spine)

Code Reference: See CTE 1: months in Appendix.

What it does:

  • Creates a hardcoded list of month-end dates for Q1 2024
  • Purpose: Defines the reporting period explicitly
  • Result: 3 rows (one per month)

Why month-end dates?

  • January 31, February 29 (2024 is leap year), March 31
  • These represent the "as-of" dates for month-end reporting

3. CTE 2: unique_accounts (Second CTE - Account Universe)

Code Reference: See CTE 2: unique_accounts in Appendix.

What it does:

  • Extracts all unique account IDs from the transactions table
  • Purpose: Defines the "universe" of accounts to report on
  • Result: One row per unique account

Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.


4. CTE 3: account_months (Third CTE - Complete Spine)

Code Reference: See CTE 3: account_months in Appendix.

What it does:

  • CROSS JOIN: Creates Cartesian product of all accounts × all months
  • Purpose: Generates a complete "spine" of (account, month) combinations
  • Result: If there are N accounts and 3 months, produces N×3 rows

Why CROSS JOIN?

  • Ensures every account appears for every month
  • Months with no transactions will show NULL balance (explicit gaps)
  • Without this, accounts with no activity in a month would be missing from results

Example output:

account_id | month_end
-----------|----------
ACC001 | 2024-01-31
ACC001 | 2024-02-29
ACC001 | 2024-03-31
ACC002 | 2024-01-31
ACC002 | 2024-02-29
ACC002 | 2024-03-31
...

5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)

Code Reference: See CTE 4: monthly_last_tx in Appendix.

What it does:

  • Window Function: ROW_NUMBER() partitions by account and month
  • Ordering: tx_date DESC, id DESC - most recent transaction first
  • Purpose: Identifies the last transaction for each account in each month
  • Filter: Only rows where rn = 1 are the last transactions

How it works:

  1. Partitions transactions by (account_id, month)
  2. Within each partition, orders by tx_date DESC (newest first)
  3. Assigns row number 1 to the most recent transaction
  4. If multiple transactions on same day, uses id DESC as tie-breaker

Example:

Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction

Result: One row per (account, month) combination that has transactions


6. Final SELECT (Outermost - Final Report)

Code Reference: See Final SELECT in Appendix.

What it does:

  • LEFT JOIN: Joins the complete spine (account_months) with the last transactions (monthly_last_tx)
  • Join conditions:
    1. account_id matches
    2. Month matches (using DATE_TRUNC to align month-end with month-start)
    3. Only last transaction (rn = 1)
  • Result: One row per (account, month) combination
    • If transaction exists: shows new_balance
    • If no transaction: shows NULL (from LEFT JOIN)

Why LEFT JOIN?

  • Preserves all rows from account_months (the spine)
  • Accounts with no transactions in a month get NULL balance
  • Without LEFT JOIN, those months would be missing from results

Final output format:

acct   | balance | month
-------|---------|--------
ACC001 | 326.00 | 2024-01
ACC001 | NULL | 2024-02
ACC001 | NULL | 2024-03
ACC002 | NULL | 2024-01
ACC002 | -25.50 | 2024-02
ACC002 | -125.50 | 2024-03
...

Execution Order (How SQL Engine Processes It)

Step-by-Step:

  1. Base table scan: transactions table (with partition pruning)
  2. CTE 1: months - Hardcoded values (instant)
  3. CTE 2: unique_accounts - DISTINCT on transactions (requires scan)
  4. CTE 3: account_months - CROSS JOIN (fast, small result set)
  5. CTE 4: monthly_last_tx - Window function on transactions (requires full scan + sort)
  6. Final SELECT: LEFT JOIN between spine and last transactions

Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.


Key Design Patterns

1. Spine Generation Pattern

  • Problem: Need to show all account×month combinations, even when no data exists
  • Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
  • Result: Gaps are explicit (NULL) rather than missing rows

2. Window Function Pattern

  • Problem: Need last transaction per account per month efficiently
  • Solution: ROW_NUMBER() with partitioning and ordering
  • Result: Single pass over data (O(n log n) vs O(n²) for nested loops)

3. Partition Pruning Pattern

  • Problem: 100M row table, but only need Q1 2024
  • Solution: WHERE clause on partitioned column (tx_date)
  • Result: Only scans relevant partitions (Jan, Feb, Mar 2024)

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

Code Reference: See Alternative 1: Simple GROUP BY in Appendix.

Problem: Doesn't show months with no transactions (missing rows)

Alternative 2: SUM(amount) (Wrong per spec)

Code Reference: See Alternative 2: SUM(amount) in Appendix.

Problem: Spec says to use new_balance from last transaction, not sum amounts

Our Approach: Spine + Window Function (Correct)

  • ✅ Shows all account×month combinations (spine)
  • ✅ Uses new_balance from last transaction (per spec)
  • ✅ Efficient for 100M rows (window function, partition pruning)
  • ✅ Handles NULL months correctly

Gold Layer Aggregation

This query demonstrates the Silver → Gold aggregation pattern. The Gold layer structure, governance, and ownership model are best described in Task 2 (Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.

Storing Results in Gold Layer

The query output represents month-end balance snapshots that can be persisted.

Code Reference: See Gold Layer Storage Example in Appendix.

Benefits of Gold Layer Storage

  • Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
  • Cost: Reduces Athena query costs (scan less data)
  • Freshness: Can be refreshed monthly after month-end close
  • Governance: Gold layer owned by Business/Finance (Gold layer structure, governance, and ownership model are best described in Task 2 architecture)
  • Stability: Stable schema and business contracts for reporting

Monthly Refresh Pattern

For monthly reporting, this query would run:

  1. After month-end close (e.g., first day of next month)
  2. Compute month-end balances from Silver layer (this query)
  3. Store results in Gold layer (s3://aggregated/monthly_balances/year=YYYY/month=MM/)
  4. Update Glue Data Catalog for Athena queries

This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.


See Also

Task 3 Documentation

Technical Documentation