Skip to main content

Appendix: SQL Complete Reference

This appendix contains all SQL-related reference materials: code, examples, and diagrams. For a high-level overview, see SQL Breakdown & Flow.


Table of Contents​


Part 1: SQL Implementation Code​

id: APPENDIX_E_SQL_CODE title: Appendix E - Full SQL Implementation Code sidebar_position: 5

Appendix E: Full SQL Implementation Code

This appendix contains the complete SQL implementation for generating month-end balance history.

File: balance_history_2024_q1.sql

/*
Task 3: Month-End Balance History

Goal: Show account balance history at the end of each month for Jan-Mar 2024.
Constraint: 100M records (requires efficient indexing/partitioning logic).
History Logic: Month-end balance is the last `new_balance` observed within a month (ordered by `tx_date`, tie-broken by `id`).
This implementation keeps months with no transactions as `NULL` (gaps are explicit via the accountΓ—month spine).
If business requirements expect a carry-forward "as-of" balance, extend the final select with a forward-fill of the last known balance.
Time semantics note: this query uses `tx_date` as the ordering/reporting timestamp; if the domain distinguishes transaction vs posting date,
month-end reporting should use the posting/cutoff timestamp.

Schema Note: This query uses the schema from Appendix A of the business case (id, account_id, amount, new_balance, tx_date).
Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp).
If this query needs to run on Task 1 output, a schema transformation/mapping step is required at the beginning or end of the pipeline,
depending on the actual raw data structure and business goals. The transformation would map:
- TransactionID β†’ id (or generate sequential id)
- CustomerID β†’ account_id
- TransactionAmount β†’ amount
- TransactionTimestamp β†’ tx_date
- new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

Engine Note: syntax is standard ANSI/Presto SQL.
*/

-- 1. Generate the spine of all (Account, Month) combinations to ensure gaps are visible
WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
),

unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
),

account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
),

-- 2. Find the last transaction for each account in each month
-- Efficient strategy: Window function partitioning by account/month
monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC -- id tie-breaker ensures deterministic result
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)

-- 3. Join spine with data to produce final report
SELECT
am.account_id AS acct,
tx.new_balance AS balance, -- Formatting string may vary by dialect (Presto: format_datetime)
TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON
am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

/*
Performance Considerations for 100M rows:

1. Partitioning: The 'transactions' table should be partitioned by tx_date (month/day).

This allows the query engine to prune partitions outside Jan-Mar 2024 immediately.

2. Parquet Optimization (Athena): Parquet column statistics (min/max) for account_id enable efficient filtering. Athena uses partition metadata and Parquet statistics to optimize operations without requiring explicit clustering or indexing.

This makes the 'DISTINCT account_id' and window function operations much faster (avoids heavy shuffle).
*/

/*
Query Explanation:

This query produces month-end balance history for Q1 2024 (January, February, March).

1. Month-end balance definition:
- The balance at the end of a month is defined as the last transaction's `new_balance`

observed within that month (ordered by `tx_date` descending, tie-broken by `id`).

2. AccountΓ—month spine:
- The query generates a complete spine of all (account, month) combinations using

CROSS JOIN to ensure gaps are visible.

- Accounts with no activity in a month return NULL for balance (explicit gaps).

3. Window function approach:
- Uses ROW_NUMBER() partitioned by account and month to identify the last transaction.
- Efficient for large datasets (100M rows) as it avoids multiple scans.

4. Partition pruning:
- The WHERE clause filters to Jan-Mar 2024, enabling partition pruning on `tx_date`.
- This minimizes data scanned and improves query performance.

5. Engine portability:
- Uses standard ANSI/Presto SQL syntax.
- Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines.
- Avoids engine-specific features for maximum portability.

Ownership & Governance (Task 3):

| Aspect | Owner | Responsibility |
| --- | --- | --- |
| **Gold Tables** | Business (Finance) | Analysts query Gold tables only for reporting |
| **Silver Tables** | Domain Teams | Technical users may query Silver for ad-hoc analysis |
| **Bronze Layer** Data Platform Team Not queried for analytics (raw audit trail only)

Rules:

- Analysts query Gold tables only (business contracts, stable schemas).
- Silver tables used by technical users for ad-hoc analysis and debugging.
- Bronze is not queried for analytics (immutable raw data, platform team access only).

*/

See Also​

Task 3 Documentation​

Technical Documentation​


Part 2: SQL Code Examples​

id: APPENDIX_F_SQL_EXAMPLES title: Appendix F - SQL Breakdown Code Examples sidebar_position: 6

This appendix contains all SQL code snippets referenced in the SQL Breakdown document. These examples illustrate the query structure from inner to outer CTEs.

Base Table Filter​

The base table filter is the innermost data source:

FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

Reference: See Base Table: transactions in SQL Breakdown.

CTE 1: months​

Creates a hardcoded list of month-end dates for Q1 2024:

WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
)

Reference: See CTE 1: months in SQL Breakdown.

CTE 2: unique_accounts​

Extracts all unique account IDs from the transactions table:

unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
)

Reference: See CTE 2: unique_accounts in SQL Breakdown.

CTE 3: account_months​

Creates a complete spine of all accountΓ—month combinations using CROSS JOIN:

account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
)

Reference: See CTE 3: account_months in SQL Breakdown.

CTE 4: monthly_last_tx​

Identifies the last transaction for each account in each month using a window function:

monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)

Reference: See CTE 4: monthly_last_tx in SQL Breakdown.

Final SELECT​

The outermost SELECT statement that joins the spine with the last transactions:

SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON
am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

Reference: See Final SELECT in SQL Breakdown.

Alternative Approaches (Not Used)​

Alternative 1: Simple GROUP BY​

This approach is incorrect because it doesn't show months with no transactions:

SELECT account_id, MAX(tx_date), new_balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Reference: See Alternative 1: Simple GROUP BY (Wrong) in SQL Breakdown.

Alternative 2: SUM(amount)​

This approach is incorrect per specification because it uses SUM instead of the last transaction's new_balance:

SELECT account_id, SUM(amount) as balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Reference: See Alternative 2: SUM(amount) (Wrong per spec) in SQL Breakdown.

Gold Layer Storage Example​

Example of how to store query results in the Gold layer:

-- Example: Store query results in Gold layer
CREATE TABLE gold.monthly_balances AS
SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month,
CURRENT_TIMESTAMP AS as_of_timestamp
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

Reference: See Storing Results in Gold Layer in SQL Breakdown.

See Also​


Part 3: SQL Query Diagrams​

id: APPENDIX_G_SQL_DIAGRAMS title: Appendix G - SQL Query Diagrams sidebar_position: 7

This appendix contains all detailed diagrams for the SQL query breakdown. For a high-level overview, see SQL Breakdown & Flow.

High-Level Query Flow​

Window Function Detail​

Data Transformation Pipeline​

Join Logic Detail​

Example Data Flow​

Performance Optimization Strategy​

Output Structure​

Error Handling & Edge Cases​

CTE 4: Window Function Detailed Flow​

Execution Order (Detailed)​


See Also​

Β© 2026 Stephen Adeiβ€’CC BY 4.0