SQL Query Breakdown: From Inner to Outer
This document breaks down the balance_history_2024_q1.sql query from the most inner table working outwards, explaining each CTE and the final SELECT statement.
Query Structure Overview
The query uses a nested CTE (Common Table Expression) approach, building from simple to complex:
Textual Flow:
transactions (base table)
↓
months (CTE 1 - innermost)
↓
unique_accounts (CTE 2)
↓
account_months (CTE 3)
↓
monthly_last_tx (CTE 4)
↓
Final SELECT (outermost)
High-Level Algorithm
The query follows a systematic approach to generate month-end balances:
FUNCTION generate_month_end_balances():
// Step 1: Generate Month Spine
months = ['2024-01-31', '2024-02-29', '2024-03-31']
// Step 2: Get Unique Accounts
unique_accounts = SELECT DISTINCT account_id
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
// Step 3: Generate Account×Month Spine (Cartesian Product)
account_months = CROSS_JOIN(unique_accounts, months)
// Result: All combinations of (account_id, month_end)
// Step 4: Find Last Transaction per Account per Month
monthly_last_tx = SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS row_number
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
// Filter to only the last transaction (row_number = 1)
last_tx_per_month = FILTER monthly_last_tx WHERE row_number = 1
// Step 5: Join Spine with Data
result = SELECT
account_months.account_id AS acct,
last_tx_per_month.new_balance AS balance,
TO_CHAR(account_months.month_end, 'YYYY-MM') AS month
FROM account_months
LEFT JOIN last_tx_per_month
ON account_months.account_id = last_tx_per_month.account_id
AND DATE_TRUNC('month', account_months.month_end) = last_tx_per_month.tx_month_start
// Step 6: Order Results
ORDER BY account_months.account_id, account_months.month_end
RETURN result
END FUNCTION
Query Objective
Generate month-end account balances for Q1 2024 (January, February, March), showing NULL for months with no transactions.
Key Design Decisions
- Spine Generation: CROSS JOIN ensures all account×month combinations are present, making gaps explicit
- Window Function: ROW_NUMBER() efficiently identifies last transaction without multiple scans
- LEFT JOIN: Preserves all account×month combinations, showing NULL for months with no activity
- Partition Pruning: WHERE clause on tx_date enables partition pruning for 100M row tables
- Deterministic Ordering: id DESC as tie-breaker ensures consistent results
Performance Optimizations
For 100M row tables:
-
Partitioning Strategy:
- Table partitioned by tx_date (month/day)
- WHERE clause enables partition pruning
- Only scans Jan-Mar 2024 partitions
-
Parquet Optimization (Athena):
- Parquet column statistics (min/max) enable efficient filtering
- Partition metadata enables partition pruning
- Speeds up DISTINCT and window function operations
- Reduces data scanned
-
Window Function Efficiency:
- Single pass over data (no multiple scans)
- Partitioned by account_id and month
- Ordered by tx_date DESC for last transaction
-
Join Strategy:
- LEFT JOIN preserves spine (no data loss)
- Join on account_id and month (optimized via partition pruning and Parquet stats)
Breakdown: Inner to Outer
1. Base Table: transactions (Innermost - Data Source)
Code Reference: See Base Table Filter in Appendix.
What it does:
- Reads from the
transactionstable (as defined in Appendix A of the business case) - Filters to Q1 2024 (January, February, March)
- Partition pruning: If table is partitioned by
tx_date, only scans relevant partitions
Schema Note: This query uses the schema from Appendix A (id, account_id, amount, new_balance, tx_date). Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp). If this query needs to run on Task 1 output, a schema transformation/mapping step is required:
- TransactionID → id (or generate sequential id)
- CustomerID → account_id
- TransactionAmount → amount
- TransactionTimestamp → tx_date
- new_balance would need to be calculated from TransactionAmount (cumulative sum per account)
Columns used:
account_id- Account identifiernew_balance- Balance after transaction (per Appendix A spec)tx_date- Transaction dateid- Transaction ID (for tie-breaking)
2. CTE 1: months (First CTE - Month Spine)
Code Reference: See CTE 1: months in Appendix.
What it does:
- Creates a hardcoded list of month-end dates for Q1 2024
- Purpose: Defines the reporting period explicitly
- Result: 3 rows (one per month)
Why month-end dates?
- January 31, February 29 (2024 is leap year), March 31
- These represent the "as-of" dates for month-end reporting
3. CTE 2: unique_accounts (Second CTE - Account Universe)
Code Reference: See CTE 2: unique_accounts in Appendix.
What it does:
- Extracts all unique account IDs from the transactions table
- Purpose: Defines the "universe" of accounts to report on
- Result: One row per unique account
Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.
4. CTE 3: account_months (Third CTE - Complete Spine)
Code Reference: See CTE 3: account_months in Appendix.
What it does:
- CROSS JOIN: Creates Cartesian product of all accounts × all months
- Purpose: Generates a complete "spine" of (account, month) combinations
- Result: If there are N accounts and 3 months, produces N×3 rows
Why CROSS JOIN?
- Ensures every account appears for every month
- Months with no transactions will show NULL balance (explicit gaps)
- Without this, accounts with no activity in a month would be missing from results
Example output:
account_id | month_end
-----------|----------
ACC001 | 2024-01-31
ACC001 | 2024-02-29
ACC001 | 2024-03-31
ACC002 | 2024-01-31
ACC002 | 2024-02-29
ACC002 | 2024-03-31
...
5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)
Code Reference: See CTE 4: monthly_last_tx in Appendix.
What it does:
- Window Function:
ROW_NUMBER()partitions by account and month - Ordering:
tx_date DESC, id DESC- most recent transaction first - Purpose: Identifies the last transaction for each account in each month
- Filter: Only rows where
rn = 1are the last transactions
How it works:
- Partitions transactions by
(account_id, month) - Within each partition, orders by
tx_date DESC(newest first) - Assigns row number 1 to the most recent transaction
- If multiple transactions on same day, uses
id DESCas tie-breaker
Example:
Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction
Result: One row per (account, month) combination that has transactions
6. Final SELECT (Outermost - Final Report)
Code Reference: See Final SELECT in Appendix.
What it does:
- LEFT JOIN: Joins the complete spine (
account_months) with the last transactions (monthly_last_tx) - Join conditions:
account_idmatches- Month matches (using
DATE_TRUNCto align month-end with month-start) - Only last transaction (
rn = 1)
- Result: One row per (account, month) combination
- If transaction exists: shows
new_balance - If no transaction: shows
NULL(from LEFT JOIN)
- If transaction exists: shows
Why LEFT JOIN?
- Preserves all rows from
account_months(the spine) - Accounts with no transactions in a month get NULL balance
- Without LEFT JOIN, those months would be missing from results
Final output format:
acct | balance | month
-------|---------|--------
ACC001 | 326.00 | 2024-01
ACC001 | NULL | 2024-02
ACC001 | NULL | 2024-03
ACC002 | NULL | 2024-01
ACC002 | -25.50 | 2024-02
ACC002 | -125.50 | 2024-03
...
Execution Order (How SQL Engine Processes It)
Step-by-Step:
- Base table scan:
transactionstable (with partition pruning) - CTE 1:
months- Hardcoded values (instant) - CTE 2:
unique_accounts- DISTINCT on transactions (requires scan) - CTE 3:
account_months- CROSS JOIN (fast, small result set) - CTE 4:
monthly_last_tx- Window function on transactions (requires full scan + sort) - Final SELECT: LEFT JOIN between spine and last transactions
Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.
Key Design Patterns
1. Spine Generation Pattern
- Problem: Need to show all account×month combinations, even when no data exists
- Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
- Result: Gaps are explicit (NULL) rather than missing rows
2. Window Function Pattern
- Problem: Need last transaction per account per month efficiently
- Solution:
ROW_NUMBER()with partitioning and ordering - Result: Single pass over data (O(n log n) vs O(n²) for nested loops)
3. Partition Pruning Pattern
- Problem: 100M row table, but only need Q1 2024
- Solution: WHERE clause on partitioned column (
tx_date) - Result: Only scans relevant partitions (Jan, Feb, Mar 2024)
Why This Approach?
Alternative 1: Simple GROUP BY (Wrong)
Code Reference: See Alternative 1: Simple GROUP BY in Appendix.
Problem: Doesn't show months with no transactions (missing rows)
Alternative 2: SUM(amount) (Wrong per spec)
Code Reference: See Alternative 2: SUM(amount) in Appendix.
Problem: Spec says to use new_balance from last transaction, not sum amounts
Our Approach: Spine + Window Function (Correct)
- ✅ Shows all account×month combinations (spine)
- ✅ Uses
new_balancefrom last transaction (per spec) - ✅ Efficient for 100M rows (window function, partition pruning)
- ✅ Handles NULL months correctly
Gold Layer Aggregation
This query demonstrates the Silver → Gold aggregation pattern. The Gold layer structure, governance, and ownership model are best described in Task 2 (Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.
Storing Results in Gold Layer
The query output represents month-end balance snapshots that can be persisted.
Code Reference: See Gold Layer Storage Example in Appendix.
Benefits of Gold Layer Storage
- Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
- Cost: Reduces Athena query costs (scan less data)
- Freshness: Can be refreshed monthly after month-end close
- Governance: Gold layer owned by Business/Finance (Gold layer structure, governance, and ownership model are best described in Task 2 architecture)
- Stability: Stable schema and business contracts for reporting
Monthly Refresh Pattern
For monthly reporting, this query would run:
- After month-end close (e.g., first day of next month)
- Compute month-end balances from Silver layer (this query)
- Store results in Gold layer (
s3://aggregated/monthly_balances/year=YYYY/month=MM/) - Update Glue Data Catalog for Athena queries
This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.
See Also
Task 3 Documentation
- Full SQL Code - Complete SQL query implementation
- SQL Diagram - Visual representation of query structure
- Full SQL Code Appendix - Complete SQL with comments
- SQL Breakdown Examples - All SQL code snippets referenced in this document
- Isolated Testing Guide - How to test the SQL query
- Testing Documentation - SQL testing overview
Related Tasks
- Data Lake Architecture - Architecture this query runs on
- ETL Pipeline - What creates the data this query reads
Technical Documentation
- Testing Guide - Comprehensive testing documentation
- Unified Testing Convention - Testing standards
- Test Results Overview - Current test execution results