Task 4: CI/CD Workflow & Infrastructure
⚠️ Scope Note: This design demonstrates DevOps best practices but would require refinement based on actual Ohpen operations, tooling, and team maturity.
Task Deliverables
This document addresses the two required deliverables for Task 4:
✅ Deliverable 1: CI/CD Workflow Description
Location: Section 1: Pipeline Design (GitHub Actions) below
Content:
- Complete workflow description with stages (Validation, Artifact Build, Deployment)
- Workflow diagram (Mermaid flowchart)
- Backfill safety checks
- Failure handling scenarios
- Promotion workflow
✅ Deliverable 2: List of Necessary Artifacts
Location: Section 3: Deployment Artifacts below
Content: Complete list of all files required to deploy the solution (see table in Section 3).
1. Pipeline Design (GitHub Actions)
"History-Safe" CI/CD process supporting backfills and reprocessing with versioning and safe rollouts.
Workflow Stages
- Validation (CI): PR-triggered linting (
ruff) and unit tests (pytest) - Artifact Build: Package ETL code, tag with Git SHA (e.g.,
etl-v1.0.0-a1b2c3d.zip) - Deployment (CD): Upload to S3, Terraform plan/apply, update Glue Job
Key Safety Features
- Determinism: Same input → same output
- Partitioning: Correct
year=YYYY/month=MMmapping - Quarantine: Invalid rows preserved (never dropped)
- Failure Handling: Failed runs never update
_LATEST.jsonorcurrent/prefix - Human Approval: Required before promoting Silver layer data to production
See Appendix A: Failure Scenarios for detailed failure handling.
2. Infrastructure as Code (Terraform)
Key Resources
- S3 Buckets:
raw,processed,quarantine,code-artifacts(versioning enabled, public access blocked) - IAM Roles: Least-privilege, prefix-scoped permissions (Bronze/Silver/Gold/Quarantine)
- AWS Glue Job: Python Shell/Spark job with S3 script path
- Step Functions: Orchestrates ETL runs with automatic retry (≤3 attempts, exponential backoff)
- EventBridge: Schedules daily ETL runs (default: 2 AM UTC, configurable cron)
- CloudWatch: Alarms for job failures and quarantine spikes
See Appendix B: Infrastructure Details for detailed orchestration and permissions.
3. Deployment Artifacts
| Artifact | Description |
|---|---|
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py | Main ETL logic |
tasks/01_data_ingestion_transformation/requirements.txt | Python dependencies |
tasks/04_devops_cicd/infra/terraform/main.tf | Infrastructure definition |
tasks/04_devops_cicd/.github/workflows/ci.yml | CI/CD pipeline definition |
tasks/01_data_ingestion_transformation/config.yaml | Runtime config template |
4. Operational Monitoring
Key Metrics
- Volume:
input_rows,valid_rows_count,quarantined_rows_count,condemned_rows_count - Quality:
quarantine_rate,validation_failure_rate,error_type_distribution - Performance:
duration_seconds,rows_processed_per_run,missing_partitions
Alert Categories
- Infrastructure (P1): Job failures, circuit breaker triggers, runtime anomalies
- Data Quality (P2): Quarantine rate spikes (>1%), validation failures, high attempt counts
- Business (P3): Volume anomalies, SLA breaches
See Appendix C: Monitoring Details for complete metrics and alert ownership.
5. Ownership & Governance
Core Ownership
- Pipeline Infrastructure: Data Platform Team (CI/CD, Step Functions, EventBridge)
- AWS Infrastructure: Data Platform Team (S3, Glue, IAM, CloudWatch)
- Validation Rules: Domain Teams (Silver) / Business (Gold)
- Data Quality: Data Quality Team (quarantine review, quality metrics)
- Schema Changes: Domain Teams (Silver) / Business (Gold) approve; Platform Team implements
See Appendix D: Governance Details for complete ownership matrices, workflows, and rules.
Related Documentation
- CI/CD Testing - Local testing guide (Appendix H)
- Test Suite Summary - Test implementation details
- ETL Pipeline - What this CI/CD deploys
- Data Lake Architecture - Infrastructure this CI/CD provisions
Appendix F
Appendix A: Failure Scenarios
Critical Rule: Failed runs never update _LATEST.json or current/ prefix.
Failure Types:
- ETL Job Failure: Non-zero exit, no
_SUCCESS, no data written → Alert triggers, safe rerun - Partial Write: Job crashes mid-execution → Partial files ignored, new
run_idon rerun - Validation Failure: Quarantine rate > threshold → Data Quality Team reviews, fixes source, reruns
- Circuit Breaker: >100 same errors/hour → Pipeline halts, Platform Team investigates
- Schema Validation: Schema drift detected → Fail fast, update schema registry, rerun
Safe Rerun: Each rerun uses new run_id, failed runs preserved for audit, only successful runs promoted.
Promotion Workflow: ETL writes to isolated run_id path → _SUCCESS marker → CloudWatch alarm → Human review (Domain Analyst + Platform Team) → Approval → Promote to production.
Appendix B: Infrastructure Details
Step Functions Orchestration:
- RunETL State: Invokes Glue job synchronously, auto-retries (≤3 attempts, exponential backoff)
- ValidateOutput State: Checks
_SUCCESSmarker, retries on eventual consistency - Error Handling: Catches failures, publishes CloudWatch metrics, logs execution details
IAM Prefix-Scoped Permissions:
- ETL Job:
bronze/*(read),silver/*(write),quarantine/*(write) - Platform Team:
bronze/*,silver/*,quarantine/*(read/write) - Domain Teams:
silver/{domain}/*(write),gold/{domain}/*(read) - Business/Analysts:
gold/*(read-only via Athena) - Compliance:
bronze/*,quarantine/*(read-only for audit)
Appendix C: Monitoring Details
Volume Metrics: run_id, input_rows, valid_rows_count, quarantined_rows_count, condemned_rows_count
Quality Metrics: quarantine_rate, validation_failure_rate, error_type_distribution
Loop Prevention: avg_attempt_count, duplicate_detection_rate, auto_condemnation_rate, circuit_breaker_triggers
Performance: rows_processed_per_run, duration_seconds, missing_partitions, runtime_anomalies
Alert Ownership:
- P1 (Immediate): Job failures, infrastructure errors, circuit breaker, SLA breaches → Data Platform Team
- P2 (2-4 hours): Quarantine spikes, validation failures, high attempt counts → Data Quality Team
- P3 (8 hours): Volume anomalies → Domain Teams
Appendix D: Governance Details
Ownership Matrix (abbreviated):
- Pipeline/CI/CD/Infrastructure: Data Platform Team
- Validation Rules: Domain Teams (Silver) / Business (Gold)
- Data Quality: Data Quality Team
- Schema: Domain Teams (Silver) / Business (Gold) approve; Platform implements
- Backfill: Platform executes; Domain/Business approves
Governance Workflows:
- Schema Change: Request → Layer-based review (Domain/Business) → Platform feasibility → Approval → Implementation → Versioning → Validation → Promotion
- Quality Issue: Alert → Data Quality triage → Source/Validation/Platform issue → Fix → Backfill approval → Reprocess → Validate → Promote
- Backfill: Request → Layer-based approval → Platform assessment → Schedule → Execute → Validate → Promote
Key Rules:
- Infrastructure changes via Terraform IaC and CI/CD only
- Failed runs never update
_LATEST.jsonorcurrent/ - Run isolation via
run_idmandatory - Human approval required for Silver promotion and condemned data deletion
- Quarantine rate thresholds configurable per dataset (default: 1%)
- Schema changes versioned via
schema_vfor backward compatibility