Appendix: SQL Complete Reference
This appendix contains all SQL-related reference materials: code, examples, and diagrams. For a high-level overview, see SQL Breakdown & Flow.
Table of Contentsβ
Part 1: SQL Implementation Codeβ
id: APPENDIX_E_SQL_CODE title: Appendix E - Full SQL Implementation Code sidebar_position: 5
Appendix E: Full SQL Implementation Code
This appendix contains the complete SQL implementation for generating month-end balance history.
File: balance_history_2024_q1.sql
/*
Task 3: Month-End Balance History
Goal: Show account balance history at the end of each month for Jan-Mar 2024.
Constraint: 100M records (requires efficient indexing/partitioning logic).
History Logic: Month-end balance is the last `new_balance` observed within a month (ordered by `tx_date`, tie-broken by `id`).
This implementation keeps months with no transactions as `NULL` (gaps are explicit via the accountΓmonth spine).
If business requirements expect a carry-forward "as-of" balance, extend the final select with a forward-fill of the last known balance.
Time semantics note: this query uses `tx_date` as the ordering/reporting timestamp; if the domain distinguishes transaction vs posting date,
month-end reporting should use the posting/cutoff timestamp.
Schema Note: This query uses the schema from Appendix A of the business case (id, account_id, amount, new_balance, tx_date).
Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp).
If this query needs to run on Task 1 output, a schema transformation/mapping step is required at the beginning or end of the pipeline,
depending on the actual raw data structure and business goals. The transformation would map:
- TransactionID β id (or generate sequential id)
- CustomerID β account_id
- TransactionAmount β amount
- TransactionTimestamp β tx_date
- new_balance would need to be calculated from TransactionAmount (cumulative sum per account)
Engine Note: syntax is standard ANSI/Presto SQL.
*/
-- 1. Generate the spine of all (Account, Month) combinations to ensure gaps are visible
WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
),
unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
),
account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
),
-- 2. Find the last transaction for each account in each month
-- Efficient strategy: Window function partitioning by account/month
monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC -- id tie-breaker ensures deterministic result
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)
-- 3. Join spine with data to produce final report
SELECT
am.account_id AS acct,
tx.new_balance AS balance, -- Formatting string may vary by dialect (Presto: format_datetime)
TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON
am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;
/*
Performance Considerations for 100M rows:
1. Partitioning: The 'transactions' table should be partitioned by tx_date (month/day).
This allows the query engine to prune partitions outside Jan-Mar 2024 immediately.
2. Parquet Optimization (Athena): Parquet column statistics (min/max) for account_id enable efficient filtering. Athena uses partition metadata and Parquet statistics to optimize operations without requiring explicit clustering or indexing.
This makes the 'DISTINCT account_id' and window function operations much faster (avoids heavy shuffle).
*/
/*
Query Explanation:
This query produces month-end balance history for Q1 2024 (January, February, March).
1. Month-end balance definition:
- The balance at the end of a month is defined as the last transaction's `new_balance`
observed within that month (ordered by `tx_date` descending, tie-broken by `id`).
2. AccountΓmonth spine:
- The query generates a complete spine of all (account, month) combinations using
CROSS JOIN to ensure gaps are visible.
- Accounts with no activity in a month return NULL for balance (explicit gaps).
3. Window function approach:
- Uses ROW_NUMBER() partitioned by account and month to identify the last transaction.
- Efficient for large datasets (100M rows) as it avoids multiple scans.
4. Partition pruning:
- The WHERE clause filters to Jan-Mar 2024, enabling partition pruning on `tx_date`.
- This minimizes data scanned and improves query performance.
5. Engine portability:
- Uses standard ANSI/Presto SQL syntax.
- Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines.
- Avoids engine-specific features for maximum portability.
Ownership & Governance (Task 3):
| Aspect | Owner | Responsibility |
| --- | --- | --- |
| **Gold Tables** | Business (Finance) | Analysts query Gold tables only for reporting |
| **Silver Tables** | Domain Teams | Technical users may query Silver for ad-hoc analysis |
| **Bronze Layer** Data Platform Team Not queried for analytics (raw audit trail only)
Rules:
- Analysts query Gold tables only (business contracts, stable schemas).
- Silver tables used by technical users for ad-hoc analysis and debugging.
- Bronze is not queried for analytics (immutable raw data, platform team access only).
*/
See Alsoβ
Task 3 Documentationβ
- SQL Breakdown - Detailed query structure explanation with high-level algorithm
- SQL Breakdown & Flow - Visual representation and breakdown of query flow
- SQL Code Examples - SQL code snippets referenced in breakdown
- SQL Query - The actual SQL file
- Isolated Testing Guide - How to test the SQL query
Related Tasksβ
- Data Lake Architecture - Architecture this query runs on
- ETL Pipeline - What creates the data this query reads
Technical Documentationβ
- Testing Guide - Comprehensive testing documentation
- Unified Testing Convention - Testing standards
Part 2: SQL Code Examplesβ
id: APPENDIX_F_SQL_EXAMPLES title: Appendix F - SQL Breakdown Code Examples sidebar_position: 6
This appendix contains all SQL code snippets referenced in the SQL Breakdown document. These examples illustrate the query structure from inner to outer CTEs.
Base Table Filterβ
The base table filter is the innermost data source:
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
Reference: See Base Table: transactions in SQL Breakdown.
CTE 1: monthsβ
Creates a hardcoded list of month-end dates for Q1 2024:
WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
)
Reference: See CTE 1: months in SQL Breakdown.
CTE 2: unique_accountsβ
Extracts all unique account IDs from the transactions table:
unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
)
Reference: See CTE 2: unique_accounts in SQL Breakdown.
CTE 3: account_monthsβ
Creates a complete spine of all accountΓmonth combinations using CROSS JOIN:
account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
)
Reference: See CTE 3: account_months in SQL Breakdown.
CTE 4: monthly_last_txβ
Identifies the last transaction for each account in each month using a window function:
monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)
Reference: See CTE 4: monthly_last_tx in SQL Breakdown.
Final SELECTβ
The outermost SELECT statement that joins the spine with the last transactions:
SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON
am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;
Reference: See Final SELECT in SQL Breakdown.
Alternative Approaches (Not Used)β
Alternative 1: Simple GROUP BYβ
This approach is incorrect because it doesn't show months with no transactions:
SELECT account_id, MAX(tx_date), new_balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)
Reference: See Alternative 1: Simple GROUP BY (Wrong) in SQL Breakdown.
Alternative 2: SUM(amount)β
This approach is incorrect per specification because it uses SUM instead of the last transaction's new_balance:
SELECT account_id, SUM(amount) as balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)
Reference: See Alternative 2: SUM(amount) (Wrong per spec) in SQL Breakdown.
Gold Layer Storage Exampleβ
Example of how to store query results in the Gold layer:
-- Example: Store query results in Gold layer
CREATE TABLE gold.monthly_balances AS
SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month,
CURRENT_TIMESTAMP AS as_of_timestamp
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;
Reference: See Storing Results in Gold Layer in SQL Breakdown.
See Alsoβ
- SQL Breakdown - Main documentation with detailed explanations
- SQL Implementation Code - Complete SQL query implementation
- SQL Breakdown & Flow - Visual representation and breakdown of query structure
Part 3: SQL Query Diagramsβ
id: APPENDIX_G_SQL_DIAGRAMS title: Appendix G - SQL Query Diagrams sidebar_position: 7
This appendix contains all detailed diagrams for the SQL query breakdown. For a high-level overview, see SQL Breakdown & Flow.
High-Level Query Flowβ
Window Function Detailβ
Data Transformation Pipelineβ
Join Logic Detailβ
Example Data Flowβ
Performance Optimization Strategyβ
Output Structureβ
Error Handling & Edge Casesβ
CTE 4: Window Function Detailed Flowβ
Execution Order (Detailed)β
Related Documentationβ
- SQL Breakdown & Flow - Main documentation with high-level overview
- SQL Implementation Code - Complete SQL implementation
- SQL Code Examples - All SQL code snippets
See Alsoβ
- SQL Breakdown - Main documentation with detailed explanations
- SQL Query - The actual SQL file
- Isolated Testing Guide - How to test the SQL query