Skip to main content

Task 4: CI/CD Workflow & Infrastructure

⚠️ Scope Note: This design demonstrates DevOps best practices but would require refinement based on actual Ohpen operations, tooling, and team maturity.


Task Deliverables

This document addresses the two required deliverables for Task 4:

✅ Deliverable 1: CI/CD Workflow Description

Location: Section 1: Pipeline Design (GitHub Actions) below

Content:

  • Complete workflow description with stages (Validation, Artifact Build, Deployment)
  • Workflow diagram (Mermaid flowchart)
  • Backfill safety checks
  • Failure handling scenarios
  • Promotion workflow

✅ Deliverable 2: List of Necessary Artifacts

Location: Section 3: Deployment Artifacts below

Content: Complete list of all files required to deploy the solution (see table in Section 3).


1. Pipeline Design (GitHub Actions)

"History-Safe" CI/CD process supporting backfills and reprocessing with versioning and safe rollouts.

Workflow Stages

  1. Validation (CI): PR-triggered linting (ruff) and unit tests (pytest)
  2. Artifact Build: Package ETL code, tag with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip)
  3. Deployment (CD): Upload to S3, Terraform plan/apply, update Glue Job

Key Safety Features

  • Determinism: Same input → same output
  • Partitioning: Correct year=YYYY/month=MM mapping
  • Quarantine: Invalid rows preserved (never dropped)
  • Failure Handling: Failed runs never update _LATEST.json or current/ prefix
  • Human Approval: Required before promoting Silver layer data to production

See Appendix A: Failure Scenarios for detailed failure handling.


2. Infrastructure as Code (Terraform)

Key Resources

  • S3 Buckets: raw, processed, quarantine, code-artifacts (versioning enabled, public access blocked)
  • IAM Roles: Least-privilege, prefix-scoped permissions (Bronze/Silver/Gold/Quarantine)
  • AWS Glue Job: Python Shell/Spark job with S3 script path
  • Step Functions: Orchestrates ETL runs with automatic retry (≤3 attempts, exponential backoff)
  • EventBridge: Schedules daily ETL runs (default: 2 AM UTC, configurable cron)
  • CloudWatch: Alarms for job failures and quarantine spikes

See Appendix B: Infrastructure Details for detailed orchestration and permissions.


3. Deployment Artifacts

ArtifactDescription
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.pyMain ETL logic
tasks/01_data_ingestion_transformation/requirements.txtPython dependencies
tasks/04_devops_cicd/infra/terraform/main.tfInfrastructure definition
tasks/04_devops_cicd/.github/workflows/ci.ymlCI/CD pipeline definition
tasks/01_data_ingestion_transformation/config.yamlRuntime config template

4. Operational Monitoring

Key Metrics

  • Volume: input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count
  • Quality: quarantine_rate, validation_failure_rate, error_type_distribution
  • Performance: duration_seconds, rows_processed_per_run, missing_partitions

Alert Categories

  • Infrastructure (P1): Job failures, circuit breaker triggers, runtime anomalies
  • Data Quality (P2): Quarantine rate spikes (>1%), validation failures, high attempt counts
  • Business (P3): Volume anomalies, SLA breaches

See Appendix C: Monitoring Details for complete metrics and alert ownership.


5. Ownership & Governance

Core Ownership

  • Pipeline Infrastructure: Data Platform Team (CI/CD, Step Functions, EventBridge)
  • AWS Infrastructure: Data Platform Team (S3, Glue, IAM, CloudWatch)
  • Validation Rules: Domain Teams (Silver) / Business (Gold)
  • Data Quality: Data Quality Team (quarantine review, quality metrics)
  • Schema Changes: Domain Teams (Silver) / Business (Gold) approve; Platform Team implements

See Appendix D: Governance Details for complete ownership matrices, workflows, and rules.



Appendix F

Appendix A: Failure Scenarios

Critical Rule: Failed runs never update _LATEST.json or current/ prefix.

Failure Types:

  1. ETL Job Failure: Non-zero exit, no _SUCCESS, no data written → Alert triggers, safe rerun
  2. Partial Write: Job crashes mid-execution → Partial files ignored, new run_id on rerun
  3. Validation Failure: Quarantine rate > threshold → Data Quality Team reviews, fixes source, reruns
  4. Circuit Breaker: >100 same errors/hour → Pipeline halts, Platform Team investigates
  5. Schema Validation: Schema drift detected → Fail fast, update schema registry, rerun

Safe Rerun: Each rerun uses new run_id, failed runs preserved for audit, only successful runs promoted.

Promotion Workflow: ETL writes to isolated run_id path → _SUCCESS marker → CloudWatch alarm → Human review (Domain Analyst + Platform Team) → Approval → Promote to production.


Appendix B: Infrastructure Details

Step Functions Orchestration:

  • RunETL State: Invokes Glue job synchronously, auto-retries (≤3 attempts, exponential backoff)
  • ValidateOutput State: Checks _SUCCESS marker, retries on eventual consistency
  • Error Handling: Catches failures, publishes CloudWatch metrics, logs execution details

IAM Prefix-Scoped Permissions:

  • ETL Job: bronze/* (read), silver/* (write), quarantine/* (write)
  • Platform Team: bronze/*, silver/*, quarantine/* (read/write)
  • Domain Teams: silver/{domain}/* (write), gold/{domain}/* (read)
  • Business/Analysts: gold/* (read-only via Athena)
  • Compliance: bronze/*, quarantine/* (read-only for audit)

Appendix C: Monitoring Details

Volume Metrics: run_id, input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count

Quality Metrics: quarantine_rate, validation_failure_rate, error_type_distribution

Loop Prevention: avg_attempt_count, duplicate_detection_rate, auto_condemnation_rate, circuit_breaker_triggers

Performance: rows_processed_per_run, duration_seconds, missing_partitions, runtime_anomalies

Alert Ownership:

  • P1 (Immediate): Job failures, infrastructure errors, circuit breaker, SLA breaches → Data Platform Team
  • P2 (2-4 hours): Quarantine spikes, validation failures, high attempt counts → Data Quality Team
  • P3 (8 hours): Volume anomalies → Domain Teams

Appendix D: Governance Details

Ownership Matrix (abbreviated):

  • Pipeline/CI/CD/Infrastructure: Data Platform Team
  • Validation Rules: Domain Teams (Silver) / Business (Gold)
  • Data Quality: Data Quality Team
  • Schema: Domain Teams (Silver) / Business (Gold) approve; Platform implements
  • Backfill: Platform executes; Domain/Business approves

Governance Workflows:

  • Schema Change: Request → Layer-based review (Domain/Business) → Platform feasibility → Approval → Implementation → Versioning → Validation → Promotion
  • Quality Issue: Alert → Data Quality triage → Source/Validation/Platform issue → Fix → Backfill approval → Reprocess → Validate → Promote
  • Backfill: Request → Layer-based approval → Platform assessment → Schedule → Execute → Validate → Promote

Key Rules:

  • Infrastructure changes via Terraform IaC and CI/CD only
  • Failed runs never update _LATEST.json or current/
  • Run isolation via run_id mandatory
  • Human approval required for Silver promotion and condemned data deletion
  • Quarantine rate thresholds configurable per dataset (default: 1%)
  • Schema changes versioned via schema_v for backward compatibility