Task 4: CI/CD Workflow & Infrastructure

⚠️ Scope Note: This design demonstrates DevOps best practices but would require refinement based on actual Ohpen operations, tooling, and team maturity.

Task Deliverables

This document addresses the two required deliverables for Task 4:

✅ Deliverable 1: CI/CD Workflow Description

Location: Section 1: Pipeline Design (GitHub Actions) below

Content:

Complete workflow description with stages (Validation, Artifact Build, Deployment)
Workflow diagram (Mermaid flowchart)
Backfill safety checks
Failure handling scenarios
Promotion workflow

✅ Deliverable 2: List of Necessary Artifacts

Location: Section 3: Deployment Artifacts below

Content: Complete list of all files required to deploy the solution (see table in Section 3).

1. Pipeline Design (GitHub Actions)

"History-Safe" CI/CD process supporting backfills and reprocessing with versioning and safe rollouts.

Workflow Stages

Validation (CI): PR-triggered linting (ruff) and unit tests (pytest)
Artifact Build: Package ETL code, tag with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip)
Deployment (CD): Upload to S3, Terraform plan/apply, update Glue Job

Key Safety Features

Determinism: Same input → same output
Partitioning: Correct year=YYYY/month=MM mapping
Quarantine: Invalid rows preserved (never dropped)
Failure Handling: Failed runs never update _LATEST.json or current/ prefix
Human Approval: Required before promoting Silver layer data to production

See Appendix A: Failure Scenarios for detailed failure handling.

2. Infrastructure as Code (Terraform)

Key Resources

S3 Buckets: raw, processed, quarantine, code-artifacts (versioning enabled, public access blocked)
IAM Roles: Least-privilege, prefix-scoped permissions (Bronze/Silver/Gold/Quarantine)
AWS Glue Job: Python Shell/Spark job with S3 script path
Step Functions: Orchestrates ETL runs with automatic retry (≤3 attempts, exponential backoff)
EventBridge: Schedules daily ETL runs (default: 2 AM UTC, configurable cron)
CloudWatch: Alarms for job failures and quarantine spikes

See Appendix B: Infrastructure Details for detailed orchestration and permissions.

3. Deployment Artifacts

Artifact	Description
`tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py`	Main ETL logic
`tasks/01_data_ingestion_transformation/requirements.txt`	Python dependencies
`tasks/04_devops_cicd/infra/terraform/main.tf`	Infrastructure definition
`tasks/04_devops_cicd/.github/workflows/ci.yml`	CI/CD pipeline definition
`tasks/01_data_ingestion_transformation/config.yaml`	Runtime config template

4. Operational Monitoring

Key Metrics

Volume: input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count
Quality: quarantine_rate, validation_failure_rate, error_type_distribution
Performance: duration_seconds, rows_processed_per_run, missing_partitions

Alert Categories

Infrastructure (P1): Job failures, circuit breaker triggers, runtime anomalies
Data Quality (P2): Quarantine rate spikes (>1%), validation failures, high attempt counts
Business (P3): Volume anomalies, SLA breaches

See Appendix C: Monitoring Details for complete metrics and alert ownership.

5. Ownership & Governance

Core Ownership

Pipeline Infrastructure: Data Platform Team (CI/CD, Step Functions, EventBridge)
AWS Infrastructure: Data Platform Team (S3, Glue, IAM, CloudWatch)
Validation Rules: Domain Teams (Silver) / Business (Gold)
Data Quality: Data Quality Team (quarantine review, quality metrics)
Schema Changes: Domain Teams (Silver) / Business (Gold) approve; Platform Team implements

See Appendix D: Governance Details for complete ownership matrices, workflows, and rules.

CI/CD Testing - Local testing guide (Appendix H)
Test Suite Summary - Test implementation details
ETL Pipeline - What this CI/CD deploys
Data Lake Architecture - Infrastructure this CI/CD provisions

Appendix F

Appendix A: Failure Scenarios

Critical Rule: Failed runs never update _LATEST.json or current/ prefix.

Failure Types:

ETL Job Failure: Non-zero exit, no _SUCCESS, no data written → Alert triggers, safe rerun
Partial Write: Job crashes mid-execution → Partial files ignored, new run_id on rerun
Validation Failure: Quarantine rate > threshold → Data Quality Team reviews, fixes source, reruns
Circuit Breaker: >100 same errors/hour → Pipeline halts, Platform Team investigates
Schema Validation: Schema drift detected → Fail fast, update schema registry, rerun

Safe Rerun: Each rerun uses new run_id, failed runs preserved for audit, only successful runs promoted.

Promotion Workflow: ETL writes to isolated run_id path → _SUCCESS marker → CloudWatch alarm → Human review (Domain Analyst + Platform Team) → Approval → Promote to production.

Appendix B: Infrastructure Details

Step Functions Orchestration:

RunETL State: Invokes Glue job synchronously, auto-retries (≤3 attempts, exponential backoff)
ValidateOutput State: Checks _SUCCESS marker, retries on eventual consistency
Error Handling: Catches failures, publishes CloudWatch metrics, logs execution details

IAM Prefix-Scoped Permissions:

ETL Job: bronze/* (read), silver/* (write), quarantine/* (write)
Platform Team: bronze/*, silver/*, quarantine/* (read/write)
Domain Teams: silver/{domain}/* (write), gold/{domain}/* (read)
Business/Analysts: gold/* (read-only via Athena)
Compliance: bronze/*, quarantine/* (read-only for audit)

Appendix C: Monitoring Details

Volume Metrics: run_id, input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count

Quality Metrics: quarantine_rate, validation_failure_rate, error_type_distribution

Loop Prevention: avg_attempt_count, duplicate_detection_rate, auto_condemnation_rate, circuit_breaker_triggers

Performance: rows_processed_per_run, duration_seconds, missing_partitions, runtime_anomalies

Alert Ownership:

P1 (Immediate): Job failures, infrastructure errors, circuit breaker, SLA breaches → Data Platform Team
P2 (2-4 hours): Quarantine spikes, validation failures, high attempt counts → Data Quality Team
P3 (8 hours): Volume anomalies → Domain Teams

Appendix D: Governance Details

Ownership Matrix (abbreviated):

Pipeline/CI/CD/Infrastructure: Data Platform Team
Validation Rules: Domain Teams (Silver) / Business (Gold)
Data Quality: Data Quality Team
Schema: Domain Teams (Silver) / Business (Gold) approve; Platform implements
Backfill: Platform executes; Domain/Business approves

Governance Workflows:

Schema Change: Request → Layer-based review (Domain/Business) → Platform feasibility → Approval → Implementation → Versioning → Validation → Promotion
Quality Issue: Alert → Data Quality triage → Source/Validation/Platform issue → Fix → Backfill approval → Reprocess → Validate → Promote
Backfill: Request → Layer-based approval → Platform assessment → Schedule → Execute → Validate → Promote

Key Rules:

Infrastructure changes via Terraform IaC and CI/CD only
Failed runs never update _LATEST.json or current/
Run isolation via run_id mandatory
Human approval required for Silver promotion and condemned data deletion
Quarantine rate thresholds configurable per dataset (default: 1%)
Schema changes versioned via schema_v for backward compatibility

Task Deliverables​

✅ Deliverable 1: CI/CD Workflow Description​

✅ Deliverable 2: List of Necessary Artifacts​

1. Pipeline Design (GitHub Actions)​

Workflow Stages​

Key Safety Features​

2. Infrastructure as Code (Terraform)​

Key Resources​

3. Deployment Artifacts​

4. Operational Monitoring​

Key Metrics​

Alert Categories​

5. Ownership & Governance​

Core Ownership​

Related Documentation​

Appendix F​

Appendix A: Failure Scenarios​

Appendix B: Infrastructure Details​

Appendix C: Monitoring Details​

Appendix D: Governance Details​