Appendix: SQL Complete Reference

This appendix contains all SQL-related reference materials: code, examples, and diagrams. For a high-level overview, see SQL Breakdown & Flow.

Part 1: SQL Implementation Code
Part 2: SQL Code Examples
Part 3: SQL Query Diagrams

Part 1: SQL Implementation Code

id: APPENDIX_E_SQL_CODE title: Appendix E - Full SQL Implementation Code sidebar_position: 5

Appendix E: Full SQL Implementation Code

This appendix contains the complete SQL implementation for generating month-end balance history.

File: balance_history_2024_q1.sql

/*
 Task 3: Month-End Balance History

 Goal: Show account balance history at the end of each month for Jan-Mar 2024.
 Constraint: 100M records (requires efficient indexing/partitioning logic).
 History Logic: Month-end balance is the last `new_balance` observed within a month (ordered by `tx_date`, tie-broken by `id`).
 This implementation keeps months with no transactions as `NULL` (gaps are explicit via the account×month spine).
 If business requirements expect a carry-forward "as-of" balance, extend the final select with a forward-fill of the last known balance.
 Time semantics note: this query uses `tx_date` as the ordering/reporting timestamp; if the domain distinguishes transaction vs posting date,
 month-end reporting should use the posting/cutoff timestamp.

 Schema Note: This query uses the schema from Appendix A of the business case (id, account_id, amount, new_balance, tx_date).
 Task 1 ETL produces a different schema (TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp).
 If this query needs to run on Task 1 output, a schema transformation/mapping step is required at the beginning or end of the pipeline,
 depending on the actual raw data structure and business goals. The transformation would map:
 - TransactionID → id (or generate sequential id)
 - CustomerID → account_id
 - TransactionAmount → amount
 - TransactionTimestamp → tx_date
 - new_balance would need to be calculated from TransactionAmount (cumulative sum per account)

 Engine Note: syntax is standard ANSI/Presto SQL.
*/

-- 1. Generate the spine of all (Account, Month) combinations to ensure gaps are visible
WITH months AS (
    SELECT CAST(date_column AS DATE) AS month_end
    FROM (
        VALUES
        ('2024-01-31'),
        ('2024-02-29'),
        ('2024-03-31')
    ) AS t (date_column)
),

unique_accounts AS (
    SELECT DISTINCT account_id FROM transactions
    -- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
),

account_months AS (
    SELECT
        a.account_id,
        m.month_end
    FROM unique_accounts AS a
    CROSS JOIN months AS m
),

-- 2. Find the last transaction for each account in each month
-- Efficient strategy: Window function partitioning by account/month
monthly_last_tx AS (
    SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC -- id tie-breaker ensures deterministic result
        ) AS rn
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)

-- 3. Join spine with data to produce final report
SELECT
    am.account_id AS acct,
    tx.new_balance AS balance, -- Formatting string may vary by dialect (Presto: format_datetime)
    TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON
        am.account_id = tx.account_id
        AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
        AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

/*
Performance Considerations for 100M rows:

1. Partitioning: The 'transactions' table should be partitioned by tx_date (month/day).

   This allows the query engine to prune partitions outside Jan-Mar 2024 immediately.

2. Parquet Optimization (Athena): Parquet column statistics (min/max) for account_id enable efficient filtering. Athena uses partition metadata and Parquet statistics to optimize operations without requiring explicit clustering or indexing.

   This makes the 'DISTINCT account_id' and window function operations much faster (avoids heavy shuffle).
*/

/*
Query Explanation:

This query produces month-end balance history for Q1 2024 (January, February, March).

1. Month-end balance definition:
   - The balance at the end of a month is defined as the last transaction's `new_balance`

     observed within that month (ordered by `tx_date` descending, tie-broken by `id`).

2. Account×month spine:
   - The query generates a complete spine of all (account, month) combinations using

     CROSS JOIN to ensure gaps are visible.

   - Accounts with no activity in a month return NULL for balance (explicit gaps).

3. Window function approach:
   - Uses ROW_NUMBER() partitioned by account and month to identify the last transaction.
   - Efficient for large datasets (100M rows) as it avoids multiple scans.

4. Partition pruning:
   - The WHERE clause filters to Jan-Mar 2024, enabling partition pruning on `tx_date`.
   - This minimizes data scanned and improves query performance.

5. Engine portability:
   - Uses standard ANSI/Presto SQL syntax.
   - Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines.
   - Avoids engine-specific features for maximum portability.

Ownership & Governance (Task 3):

| Aspect | Owner | Responsibility |
| --- | --- | --- |
| **Gold Tables** | Business (Finance) | Analysts query Gold tables only for reporting |
| **Silver Tables** | Domain Teams | Technical users may query Silver for ad-hoc analysis |
| **Bronze Layer**  Data Platform Team  Not queried for analytics (raw audit trail only)

Rules:

- Analysts query Gold tables only (business contracts, stable schemas).
- Silver tables used by technical users for ad-hoc analysis and debugging.
- Bronze is not queried for analytics (immutable raw data, platform team access only).

*/

Part 2: SQL Code Examples

id: APPENDIX_F_SQL_EXAMPLES title: Appendix F - SQL Breakdown Code Examples sidebar_position: 6

This appendix contains all SQL code snippets referenced in the SQL Breakdown document. These examples illustrate the query structure from inner to outer CTEs.

Base Table Filter

The base table filter is the innermost data source:

FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'

Reference: See Base Table: transactions in SQL Breakdown.

CTE 1: months

Creates a hardcoded list of month-end dates for Q1 2024:

WITH months AS (
    SELECT CAST(date_column AS DATE) AS month_end
    FROM (
        VALUES
        ('2024-01-31'),
        ('2024-02-29'),
        ('2024-03-31')
    ) AS t (date_column)
)

Reference: See CTE 1: months in SQL Breakdown.

CTE 2: unique_accounts

Extracts all unique account IDs from the transactions table:

unique_accounts AS (
    SELECT DISTINCT account_id FROM transactions
    -- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
)

Reference: See CTE 2: unique_accounts in SQL Breakdown.

CTE 3: account_months

Creates a complete spine of all account×month combinations using CROSS JOIN:

account_months AS (
    SELECT
        a.account_id,
        m.month_end
    FROM unique_accounts AS a
    CROSS JOIN months AS m
)

Reference: See CTE 3: account_months in SQL Breakdown.

CTE 4: monthly_last_tx

Identifies the last transaction for each account in each month using a window function:

monthly_last_tx AS (
    SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC
        ) AS rn
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)

Reference: See CTE 4: monthly_last_tx in SQL Breakdown.

Final SELECT

The outermost SELECT statement that joins the spine with the last transactions:

SELECT
    am.account_id AS acct,
    tx.new_balance AS balance,
    TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON
        am.account_id = tx.account_id
        AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
        AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

Reference: See Final SELECT in SQL Breakdown.

Alternative Approaches (Not Used)

Alternative 1: Simple GROUP BY

This approach is incorrect because it doesn't show months with no transactions:

SELECT account_id, MAX(tx_date), new_balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Reference: See Alternative 1: Simple GROUP BY (Wrong) in SQL Breakdown.

Alternative 2: SUM(amount)

This approach is incorrect per specification because it uses SUM instead of the last transaction's new_balance:

SELECT account_id, SUM(amount) as balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Reference: See Alternative 2: SUM(amount) (Wrong per spec) in SQL Breakdown.

Gold Layer Storage Example

Example of how to store query results in the Gold layer:

-- Example: Store query results in Gold layer
CREATE TABLE gold.monthly_balances AS
SELECT
    am.account_id AS acct,
    tx.new_balance AS balance,
    TO_CHAR(am.month_end, 'YYYY-MM') AS month,
    CURRENT_TIMESTAMP AS as_of_timestamp
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON am.account_id = tx.account_id
    AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
    AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

Reference: See Storing Results in Gold Layer in SQL Breakdown.

Part 3: SQL Query Diagrams

id: APPENDIX_G_SQL_DIAGRAMS title: Appendix G - SQL Query Diagrams sidebar_position: 7

This appendix contains all detailed diagrams for the SQL query breakdown. For a high-level overview, see SQL Breakdown & Flow.

Appendix: SQL Complete Reference

Table of Contents

Part 1: SQL Implementation Code

Appendix E: Full SQL Implementation Code

See Also

Task 3 Documentation

Technical Documentation

Part 2: SQL Code Examples

Base Table Filter

CTE 1: months

CTE 2: unique_accounts

CTE 3: account_months

CTE 4: monthly_last_tx

Final SELECT

Alternative Approaches (Not Used)

Alternative 1: Simple GROUP BY

Alternative 2: SUM(amount)

Gold Layer Storage Example

See Also

Part 3: SQL Query Diagrams

High-Level Query Flow

Window Function Detail

Data Transformation Pipeline

Join Logic Detail

Example Data Flow

Performance Optimization Strategy

Output Structure

Error Handling & Edge Cases

CTE 4: Window Function Detailed Flow

Execution Order (Detailed)

See Also

Table of Contents​

Part 1: SQL Implementation Code​

Appendix E: Full SQL Implementation Code

See Also​

Task 3 Documentation​

Related Tasks​

Technical Documentation​

Part 2: SQL Code Examples​

Base Table Filter​

CTE 1: months​

CTE 2: unique_accounts​

CTE 3: account_months​

CTE 4: monthly_last_tx​

Final SELECT​

Alternative Approaches (Not Used)​

Alternative 1: Simple GROUP BY​

Alternative 2: SUM(amount)​

Gold Layer Storage Example​

See Also​

Part 3: SQL Query Diagrams​

High-Level Query Flow​

Window Function Detail​

Data Transformation Pipeline​

Join Logic Detail​

Example Data Flow​

Performance Optimization Strategy​

Output Structure​

Error Handling & Edge Cases​

CTE 4: Window Function Detailed Flow​

Execution Order (Detailed)​

Related Documentation​

See Also​

Table of Contents

Part 1: SQL Implementation Code

See Also

Task 3 Documentation

Related Tasks

Technical Documentation

Part 2: SQL Code Examples

Base Table Filter

CTE 1: months

CTE 2: unique_accounts

CTE 3: account_months

CTE 4: monthly_last_tx

Final SELECT

Alternative Approaches (Not Used)

Alternative 1: Simple GROUP BY

Alternative 2: SUM(amount)

Gold Layer Storage Example

See Also

Part 3: SQL Query Diagrams

High-Level Query Flow

Window Function Detail

Data Transformation Pipeline

Join Logic Detail

Example Data Flow

Performance Optimization Strategy

Output Structure

Error Handling & Edge Cases

CTE 4: Window Function Detailed Flow

Execution Order (Detailed)

Related Documentation

See Also