CI/CD Workflow & Infrastructure

Important Note on Scope

⚠️ Scope Disclaimer: This CI/CD workflow design is based on the case study requirements and my interpretation of the problem. I have made assumptions that may underestimate the current scope of operations at Ohpen. The workflow presented here would need refinement based on:

Actual deployment frequency and release processes
Existing CI/CD infrastructure and tooling
Team size, skills, and operational maturity
Production change management and approval processes
Real-world monitoring and alerting requirements

This design demonstrates DevOps thinking and best practices, but would require collaboration with the Ohpen team for production implementation.

1. Pipeline Design (GitHub Actions)

We implement a "History-Safe" CI/CD process. Since we support backfills and reprocessing, our deployment pipeline must support versioning and safe rollouts.

Workflow Stages

Validation (CI): Runs on every Pull Request.
- ruff linting (code style).
- pytest unit tests (partition logic, null handling, quarantine checks).
Artifact Build:
- Packages the Python ETL code.
- Tags artifact with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip).
Deployment (CD):
- Uploads artifact to S3 (Code Bucket).
- Terraform plan & apply to update AWS infrastructure (Glue Jobs, IAM, Buckets).
- Updates the Glue Job to point to the new artifact.

Backfill Safety Checks

Our tests specifically cover "history safety":

Determinism: Rerunning the same input produces the exact same counts.
Partitioning: Timestamps strictly map to the correct year=YYYY/month=MM folder.
Quarantine: Invalid rows are never silently dropped; they must appear in quarantine.

Failure Handling

Critical Rule: Failed runs do not update _LATEST.json or current/ prefix.

Failure Scenarios & Behavior:

ETL Job Failure (No Output):
- Job exits with non-zero status.
- No _SUCCESS marker is written.
- No data files are written (or partial files are ignored).
- _LATEST.json remains unchanged (points to previous successful run).
- current/ prefix remains unchanged (stable for SQL queries).
- Action: Alert triggers, platform team investigates, job can be rerun safely.
Partial Write (Job Crashes Mid-Execution):
- Some data files may be written to run_id={...}/ path.
- _SUCCESS marker is missing (incomplete run).
- Consumers ignore the run (only read runs with _SUCCESS).
- _LATEST.json and current/ remain unchanged.
- Action: Platform team can safely rerun (new run_id), or clean up partial files.
Validation Failure (Data Quality Issues):
- Job completes but validation checks fail (e.g., quarantine rate > threshold).
- _SUCCESS marker may or may not be written (depends on validation stage).
- If validation fails before _SUCCESS, run is treated as failed.
- _LATEST.json and current/ remain unchanged.
- Action: Data quality team reviews quarantine, fixes source data, reruns.
Circuit Breaker Triggered:
- Pipeline halts automatically when >100 same errors occur within 1 hour.
- Job exits with RuntimeError (non-zero status).
- No _SUCCESS marker is written.
- No data files are written (or partial files are ignored).
- _LATEST.json and current/ remain unchanged.
- Action: Platform team investigates root cause, fixes systemic issue, manually restarts pipeline.
Schema Validation Failure:
- Schema drift detected (unexpected columns, type mismatches).
- Job fails fast before writing any output.
- _LATEST.json and current/ remain unchanged.
- Action: Schema registry updated, job rerun with new schema version.

Safe Rerun Behavior:

Each rerun uses a new run_id (timestamp-based).
Previous failed runs remain in storage (audit trail).
Only successful runs (with _SUCCESS) are considered for promotion.
Promotion to _LATEST.json and current/ is a separate, explicit step after validation.

Promotion Workflow

⚠️ IMPORTANT: For financial data compliance, human approval is required before promoting Silver layer data to production. See HUMAN_VALIDATION_POLICY.md for details.

1. ETL writes to: silver/.../schema_v=v1/run_id=20260121T120000Z/... (isolated run_id path)
1. ETL writes _SUCCESS marker with metrics
1. CloudWatch alarm triggers: "New Run Available for Review" (P2 - 4 hours)
1. Human Review Required:
   - Domain Analyst reviews quality metrics (quarantine rate, volume, schema)
   - Business validation (sample data review)
   - Technical validation (Platform Team)
1. If approved: Update Glue Catalog → promote to production consumption
1. If rejected: Leave run in isolated path, notify Platform Team to investigate
1. Audit: All approvals logged with timestamp, approver, metrics, reason

Note: _LATEST.json and current/ prefix are for Gold layer (**Gold layer structure, governance, and ownership model are best described in Task 2** architecture design), not Silver layer (Task 1).

2. Infrastructure as Code (Terraform)

We use Terraform to manage the Data Lake lifecycle.

Key Resources

S3 Buckets: raw, processed, quarantine, code-artifacts.
- Policy: Block public access, enable versioning (critical for recovery).
IAM Roles: Least-privilege role for the ETL job (Read Raw, Write Processed/Quarantine).
- Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
  - ETL Job Role: s3://bucket/bronze/* (read), s3://bucket/silver/* (write), s3://bucket/quarantine/* (write)
  - Platform Team: s3://bucket/bronze/*, s3://bucket/silver/*, s3://bucket/quarantine/* (read/write)
  - Domain Teams: s3://bucket/silver/{domain}/* (write), s3://bucket/gold/{domain}/* (read)
  - Business/Analysts: s3://bucket/gold/* (read-only via Athena)
  - Compliance: s3://bucket/bronze/*, s3://bucket/quarantine/* (read-only for audit)
AWS Glue Job: Defines the Python Shell or Spark job, injected with the S3 path to the script.
AWS Step Functions: ✅ Implemented - Orchestrates scheduled ETL runs with automatic retry and error handling
EventBridge: ✅ Implemented - Schedules daily ETL runs (configurable cron expression)
AWS Lambda: ❌ Not Implemented - Step Functions provides better orchestration; Lambda not needed
Monitoring: CloudWatch Alarms.
- Alarm: QuarantineRows > 0 (investigate data quality issues).
- Alarm: JobFailure.

Step Functions Orchestration

What Events Are Scheduled:

Daily ETL Runs: Process new transaction CSV files from Bronze layer
- Default schedule: Daily at 2 AM UTC (cron(0 2 * * ? *))
- Alternative: Monthly on 1st day (cron(0 2 1 * ? *))
- Configurable via EventBridge rule

What Step Functions Does:

RunETL State:
- Invokes AWS Glue Spark job synchronously
- Automatically retries on transient failures (Glue throttling, service exceptions)
- Max ≤3 retry attempts with exponential backoff (Step Functions retry logic)
ValidateOutput State:
- Checks for _SUCCESS marker in Silver layer output
- Verifies ETL job completed successfully
- Retries if marker not found (handles eventual consistency)
Error Handling:
- Catches all failures and transitions to HandleFailure state
- Publishes failure metrics to CloudWatch
- Logs execution details for debugging

Benefits:

Automatic Retry: Handles transient Glue failures automatically
Visual Monitoring: Step Functions console shows execution flow
Error Recovery: Built-in error handling and state management
Cost: ~$0.01/month for daily runs (negligible)

EventBridge Integration:

EventBridge rule triggers Step Functions state machine on schedule
No Lambda needed - direct EventBridge → Step Functions integration
Configurable cron expression for different schedules

3. Deployment Artifacts

List of files required to deploy this solution:

Artifact	Description
`tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py`	The main ETL logic.
`tasks/01_data_ingestion_transformation/requirements.txt`	Python dependencies (pandas, pyarrow, boto3).
`tasks/04_devops_cicd/infra/terraform/main.tf`	Infrastructure definition.
`tasks/04_devops_cicd/.github/workflows/ci.yml`	Automation pipeline definition.
`tasks/01_data_ingestion_transformation/config.yaml`	Runtime config template (bucket names, prefixes, allow-lists). The ETL currently uses CLI args/env vars; this file is a template for config-driven runs.

4. Operational Monitoring

To ensure reliability, we emit structured logs (JSON) that CloudWatch Insights can query:

Volume Metrics:

run_id: Trace ID for the execution.
input_rows: Total rows read from Bronze layer.
valid_rows_count: Rows successfully validated and written to Silver layer.
quarantined_rows_count: Rows quarantined due to validation failures.
condemned_rows_count: Rows auto-condemned (duplicates or max attempts exceeded: ≤3 retries allowed, condemned after 3rd failure). Human review and approval required before reprocessing.

Quality Metrics:

quarantine_rate: Percentage of rows quarantined.
validation_failure_rate: Quality metric.
error_type_distribution: Breakdown by error type (SCHEMA_ERROR, NULL_VALUE_ERROR, etc.).

Loop Prevention Metrics:

avg_attempt_count: Average attempt_count across all processed rows.
duplicate_detection_rate: Percentage of rows flagged as exact duplicates.
auto_condemnation_rate: Percentage of rows auto-condemned.
circuit_breaker_triggers: Count of circuit breaker activations.

Performance Metrics:

rows_processed_per_run: Throughput metric.
duration_seconds: ETL execution time.
missing_partitions: Completeness metric.
runtime_anomalies: Performance metric.

Monitoring Alerts

Infrastructure Alerts (Data Platform Team):
- Job failure (non-zero exit, missing _SUCCESS).
- Missing partitions (expected partitions not present).
- Runtime anomalies (unusually long execution time).
- Circuit breaker triggered (>100 same errors/hour → automatic pipeline halt, P1 - Immediate).
Data Quality Alerts (Data Quality Team):
- Quarantine rate spike (quarantined_rows / input_rows > 1%, P2 - 4 hours).
- Validation failure (schema drift, type mismatches, P2 - 2 hours).
- High attempt_count (avg > 1.5, P2 - 4 hours).
- Auto-condemnation spike (rate > 0.5%, P2 - 4 hours).
Business Metric Alerts (Domain Teams / Business):
- Volume anomaly (too few or too many rows vs baseline, P3 - 8 hours).
- SLA breach (data freshness, availability, P1 - 1 hour).

5. Ownership & Governance (Task 4)

5.1 Ownership Matrix

Aspect	Owner	Steward	Execution	Responsibility
Pipeline Infrastructure	Data Platform Team	Platform Lead	Data Platform Team	CI/CD pipeline reliability, deployment automation, AWS Step Functions/EventBridge job scheduling, run isolation mechanisms
CI/CD Automation	Data Platform Team	Platform Lead	Data Platform Team	GitHub Actions workflows, Terraform IaC, artifact versioning, AWS Glue job deployment automation
AWS Infrastructure	Data Platform Team	Platform Lead	Data Platform Team	S3 buckets, Glue jobs, IAM policies, CloudWatch infrastructure, resource provisioning
Infrastructure Monitoring	Data Platform Team	Platform Lead	Data Platform Team	Job failure detection, system health monitoring, infrastructure alerting, run completeness checks
Validation Rules	Domain Teams (Silver) / Business (Gold)	Domain Analyst / Finance Controller	Data Platform Team	Validation logic definition, business rule specification, quality threshold configuration
Data Quality Monitoring	Data Quality Team	Data Quality Lead	Data Quality Team	Quarantine review, quality metric interpretation, quality threshold monitoring, source data issue triage
Quarantine Resolution	Data Quality Team / Domain Teams	Data Quality Lead	Data Quality Team / Domain Teams	Invalid row investigation, source data correction, quarantine workflow management
Silver Layer Schema	Domain Teams	Domain Analyst	Data Platform Team	Schema change approval, schema version management, validation logic evolution
Gold Layer Schema	Business (Finance)	Finance Controller	Data Platform Team	Business contract definition, reporting schema approval, stakeholder communication
Schema Implementation	Data Platform Team	Platform Lead	Data Platform Team	Technical implementation of approved schema changes, schema versioning, backward compatibility
Dataset Ownership Metadata	Domain Teams (Silver) / Business (Gold)	Domain Analyst / Finance Controller	Data Platform Team	Business context, dataset purpose, consumer requirements, ownership assignment
Run Lineage & Audit Logs	Data Platform Team	Platform Lead	Data Platform Team	Technical lineage tracking, infrastructure audit logs, run metadata capture
Backfill Execution	Data Platform Team	Platform Lead	Data Platform Team	Technical reprocessing, partition-level backfills, run isolation for backfills
Backfill Approval	Domain Teams (Silver) / Business (Gold)	Domain Analyst / Finance Controller	N/A	Business approval for reprocessing, data correction requests, historical data updates
Promotion to Current	Domain Teams (Silver) / Business (Gold)	Domain Analyst / Finance Controller	Data Platform Team	Validation approval, promotion decision, `_LATEST.json` and `current/` updates

5.4 Contact Information

Role	Contact Method	Primary Contact	Escalation Contact
Platform Lead	Email: [email protected]; Slack: #data-platform	Platform Lead	Infrastructure Manager
Domain Analyst	Email: [email protected]; Slack: #data-domain	Domain Analyst	Domain Manager
Finance Controller	Email: [email protected]; Slack: #finance-data	Finance Controller	Finance Director
Data Quality Lead	Email: [email protected]; Slack: #data-quality	Data Quality Lead	Data Quality Manager
On-Call Engineer	PagerDuty: data-platform-oncall; Slack: @oncall-data-platform	On-Call Rotation	Platform Lead

Note: Replace placeholder email addresses and Slack channels with actual organizational contacts. Contact information should be maintained in a centralized directory (e.g., company wiki, directory service) and kept up-to-date.

5.2 Alert Ownership & Escalation

Alert Type	Primary Owner	Escalation Path	Response Time
Job Failure (non-zero exit, missing `_SUCCESS`)	Data Platform Team	Platform Lead → On-call Engineer	Immediate (P1)
Infrastructure Errors (S3 access failures, Glue job crashes)	Data Platform Team	Platform Lead → Infrastructure Team	Immediate (P1)
Circuit Breaker Triggered (>100 same errors/hour, pipeline halted)	Data Platform Team	Platform Lead → On-call Engineer	Immediate (P1)
Quarantine Rate Spike (`quarantined_rows / input_rows > 1%`)	Data Quality Team	Data Quality Lead → Domain Teams	4 hours (P2)
Validation Failure (schema drift, type mismatches)	Data Quality Team	Data Quality Lead → Domain Teams → Platform Team	2 hours (P2)
High Attempt Count (avg `attempt_count` > 1.5)	Data Quality Team	Data Quality Lead → Domain Teams	4 hours (P2)
Auto-Condemnation Spike (auto-condemnation rate > 0.5%)	Data Quality Team	Data Quality Lead → Domain Teams	4 hours (P2)
Volume Anomaly (too few/many rows vs baseline)	Domain Teams	Domain Analyst → Data Quality Team	8 hours (P3)
Missing Partitions (expected partitions not present)	Data Platform Team	Platform Lead → Domain Teams	4 hours (P2)
Runtime Anomalies (unusually long execution time)	Data Platform Team	Platform Lead → Infrastructure Team	4 hours (P2)
SLA Breach (data freshness, availability)	Domain Teams / Business	Domain Analyst / Finance Controller → Platform Team	1 hour (P1)

5.3 Operational Responsibilities Matrix

Responsibility Type	Bronze Layer	Silver Layer	Gold Layer
Ownership (who decides)	Data Platform Team	Domain Teams	Business (Finance)
Stewardship (who maintains)	Ingestion Lead	Domain Analyst	Finance Controller
Execution (who implements)	Data Platform Team	Data Platform Team	Data Platform Team
Consumption (who uses)	Platform engineers, audit	Analysts, data scientists	Finance, BI, stakeholders
Change Approval	Platform Lead	Domain Analyst + Platform review	Finance Controller + Platform review
Quality Monitoring	Platform Team (ingestion reliability)	Data Quality Team (validation rules)	Business (reporting accuracy)

5.4 Governance Workflows

Schema Change Workflow

Data Quality Issue Resolution Workflow

Backfill Approval Workflow

5.5 Governance Rules

Infrastructure & Platform Rules

Platform team owns all pipeline reliability, deployment automation, and infrastructure provisioning.
All infrastructure changes must go through Terraform IaC and CI/CD pipeline.
Failed runs never update _LATEST.json or current/ (explicit promotion only).
Run isolation via run_id is mandatory for all ETL executions.
Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).
Human approval required before deleting condemned data (see HUMAN_VALIDATION_POLICY.md).

Data Quality Rules

Data Quality Team owns quarantine review and quality metric interpretation.
Quarantine rate thresholds are configurable per dataset (default: 1%).
All invalid rows must be preserved in quarantine (never silently dropped).
Quality alerts require Data Quality Team triage before escalation.

Schema Governance Rules

Silver layer schema changes require Domain Team approval + Platform Team implementation.
Gold layer schema changes require Business/Finance approval + Platform Team implementation.
All schema changes must be versioned via schema_v for backward compatibility.
Schema changes require quality validation and backfill if needed.

Backfill & Reprocessing Rules

Technical execution of backfills is owned by Platform Team.
Business approval for backfills is required from Domain Teams (Silver) or Business (Gold).
All backfills write to new run_id paths (no overwrites).
Promotion to current/ requires explicit validation and approval (Gold layer only - Gold layer structure, governance, and ownership model are best described in Task 2 architecture design).
Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).

Monitoring & Alerting Rules

Infrastructure alerts (P1) route to Platform Team for immediate response.
Data quality alerts (P2) route to Data Quality Team for investigation.
Business metric alerts (P3) route to Domain Teams for review.
Alert ownership is clearly defined with escalation paths documented.

Alignment with Task 2 Architecture

Bronze layer ownership aligns with Task 2: Platform Team (immutability, ingestion).
Silver layer ownership aligns with Task 2: Domain Teams (validation, schema).
Gold layer ownership aligns with Task 2 architecture: Business/Finance (contracts, reporting). The Gold layer structure, governance, and ownership model are best described in Task 2.
Governance workflows are consistent across Task 2 and Task 4 documentation.

Task 4 Documentation

CI/CD Testing - How to test CI/CD workflows locally
Test Suite Summary - Test suite implementation details
Test Suite Documentation - Detailed test documentation
Scripts Documentation - CI/CD testing scripts

ETL Pipeline - What this CI/CD deploys
Data Lake Architecture - Infrastructure this CI/CD provisions
SQL Query - Code validated by this CI/CD
Terraform Configuration - Infrastructure as code

Technical Documentation

Unified Testing Convention - Testing standards
Testing Guide - Comprehensive testing documentation

Important Note on Scope​

1. Pipeline Design (GitHub Actions)​

Workflow Stages​

Backfill Safety Checks​

Failure Handling​

Promotion Workflow​

2. Infrastructure as Code (Terraform)​

Key Resources​

Step Functions Orchestration​

3. Deployment Artifacts​

4. Operational Monitoring​

Monitoring Alerts​

5. Ownership & Governance (Task 4)​

5.1 Ownership Matrix​

5.4 Contact Information​

5.2 Alert Ownership & Escalation​

5.3 Operational Responsibilities Matrix​

5.4 Governance Workflows​

Schema Change Workflow​

Data Quality Issue Resolution Workflow​

Backfill Approval Workflow​

5.5 Governance Rules​

Infrastructure & Platform Rules​

Data Quality Rules​

Schema Governance Rules​

Backfill & Reprocessing Rules​

Monitoring & Alerting Rules​

Alignment with Task 2 Architecture​

Related Documentation​

Task 4 Documentation​

Related Tasks​

Technical Documentation​