CI/CD Workflow & Infrastructure
Important Note on Scope
⚠️ Scope Disclaimer: This CI/CD workflow design is based on the case study requirements and my interpretation of the problem. I have made assumptions that may underestimate the current scope of operations at Ohpen. The workflow presented here would need refinement based on:
- Actual deployment frequency and release processes
- Existing CI/CD infrastructure and tooling
- Team size, skills, and operational maturity
- Production change management and approval processes
- Real-world monitoring and alerting requirements
This design demonstrates DevOps thinking and best practices, but would require collaboration with the Ohpen team for production implementation.
1. Pipeline Design (GitHub Actions)
We implement a "History-Safe" CI/CD process. Since we support backfills and reprocessing, our deployment pipeline must support versioning and safe rollouts.
Workflow Stages
- Validation (CI): Runs on every Pull Request.
rufflinting (code style).pytestunit tests (partition logic, null handling, quarantine checks).
- Artifact Build:
- Packages the Python ETL code.
- Tags artifact with Git SHA (e.g.,
etl-v1.0.0-a1b2c3d.zip).
- Deployment (CD):
- Uploads artifact to S3 (Code Bucket).
- Terraform
plan&applyto update AWS infrastructure (Glue Jobs, IAM, Buckets). - Updates the Glue Job to point to the new artifact.
Backfill Safety Checks
Our tests specifically cover "history safety":
- Determinism: Rerunning the same input produces the exact same counts.
- Partitioning: Timestamps strictly map to the correct
year=YYYY/month=MMfolder. - Quarantine: Invalid rows are never silently dropped; they must appear in quarantine.
Failure Handling
Critical Rule: Failed runs do not update _LATEST.json or current/ prefix.
Failure Scenarios & Behavior:
-
ETL Job Failure (No Output):
- Job exits with non-zero status.
- No
_SUCCESSmarker is written. - No data files are written (or partial files are ignored).
_LATEST.jsonremains unchanged (points to previous successful run).current/prefix remains unchanged (stable for SQL queries).- Action: Alert triggers, platform team investigates, job can be rerun safely.
-
Partial Write (Job Crashes Mid-Execution):
- Some data files may be written to
run_id={...}/path. _SUCCESSmarker is missing (incomplete run).- Consumers ignore the run (only read runs with
_SUCCESS). _LATEST.jsonandcurrent/remain unchanged.- Action: Platform team can safely rerun (new
run_id), or clean up partial files.
- Some data files may be written to
-
Validation Failure (Data Quality Issues):
- Job completes but validation checks fail (e.g., quarantine rate > threshold).
_SUCCESSmarker may or may not be written (depends on validation stage).- If validation fails before
_SUCCESS, run is treated as failed. _LATEST.jsonandcurrent/remain unchanged.- Action: Data quality team reviews quarantine, fixes source data, reruns.
-
Circuit Breaker Triggered:
- Pipeline halts automatically when >100 same errors occur within 1 hour.
- Job exits with
RuntimeError(non-zero status). - No
_SUCCESSmarker is written. - No data files are written (or partial files are ignored).
_LATEST.jsonandcurrent/remain unchanged.- Action: Platform team investigates root cause, fixes systemic issue, manually restarts pipeline.
-
Schema Validation Failure:
- Schema drift detected (unexpected columns, type mismatches).
- Job fails fast before writing any output.
_LATEST.jsonandcurrent/remain unchanged.- Action: Schema registry updated, job rerun with new schema version.
Safe Rerun Behavior:
- Each rerun uses a new
run_id(timestamp-based). - Previous failed runs remain in storage (audit trail).
- Only successful runs (with
_SUCCESS) are considered for promotion. - Promotion to
_LATEST.jsonandcurrent/is a separate, explicit step after validation.
Promotion Workflow
⚠️ IMPORTANT: For financial data compliance, human approval is required before promoting Silver layer data to production. See HUMAN_VALIDATION_POLICY.md for details.
1. ETL writes to: silver/.../schema_v=v1/run_id=20260121T120000Z/... (isolated run_id path)
1. ETL writes _SUCCESS marker with metrics
1. CloudWatch alarm triggers: "New Run Available for Review" (P2 - 4 hours)
1. Human Review Required:
- Domain Analyst reviews quality metrics (quarantine rate, volume, schema)
- Business validation (sample data review)
- Technical validation (Platform Team)
1. If approved: Update Glue Catalog → promote to production consumption
1. If rejected: Leave run in isolated path, notify Platform Team to investigate
1. Audit: All approvals logged with timestamp, approver, metrics, reason
Note: _LATEST.json and current/ prefix are for Gold layer (**Gold layer structure, governance, and ownership model are best described in Task 2** architecture design), not Silver layer (Task 1).
2. Infrastructure as Code (Terraform)
We use Terraform to manage the Data Lake lifecycle.
Key Resources
- S3 Buckets:
raw,processed,quarantine,code-artifacts.- Policy: Block public access, enable versioning (critical for recovery).
- IAM Roles: Least-privilege role for the ETL job (Read Raw, Write Processed/Quarantine).
- Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
- ETL Job Role:
s3://bucket/bronze/*(read),s3://bucket/silver/*(write),s3://bucket/quarantine/*(write) - Platform Team:
s3://bucket/bronze/*,s3://bucket/silver/*,s3://bucket/quarantine/*(read/write) - Domain Teams:
s3://bucket/silver/{domain}/*(write),s3://bucket/gold/{domain}/*(read) - Business/Analysts:
s3://bucket/gold/*(read-only via Athena) - Compliance:
s3://bucket/bronze/*,s3://bucket/quarantine/*(read-only for audit)
- ETL Job Role:
- Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
- AWS Glue Job: Defines the Python Shell or Spark job, injected with the S3 path to the script.
- AWS Step Functions: ✅ Implemented - Orchestrates scheduled ETL runs with automatic retry and error handling
- EventBridge: ✅ Implemented - Schedules daily ETL runs (configurable cron expression)
- AWS Lambda: ❌ Not Implemented - Step Functions provides better orchestration; Lambda not needed
- Monitoring: CloudWatch Alarms.
- Alarm:
QuarantineRows > 0(investigate data quality issues). - Alarm:
JobFailure.
- Alarm:
Step Functions Orchestration
What Events Are Scheduled:
- Daily ETL Runs: Process new transaction CSV files from Bronze layer
- Default schedule: Daily at 2 AM UTC (
cron(0 2 * * ? *)) - Alternative: Monthly on 1st day (
cron(0 2 1 * ? *)) - Configurable via EventBridge rule
- Default schedule: Daily at 2 AM UTC (
What Step Functions Does:
-
RunETL State:
- Invokes AWS Glue Spark job synchronously
- Automatically retries on transient failures (Glue throttling, service exceptions)
- Max ≤3 retry attempts with exponential backoff (Step Functions retry logic)
-
ValidateOutput State:
- Checks for
_SUCCESSmarker in Silver layer output - Verifies ETL job completed successfully
- Retries if marker not found (handles eventual consistency)
- Checks for
-
Error Handling:
- Catches all failures and transitions to
HandleFailurestate - Publishes failure metrics to CloudWatch
- Logs execution details for debugging
- Catches all failures and transitions to
Benefits:
- Automatic Retry: Handles transient Glue failures automatically
- Visual Monitoring: Step Functions console shows execution flow
- Error Recovery: Built-in error handling and state management
- Cost: ~$0.01/month for daily runs (negligible)
EventBridge Integration:
- EventBridge rule triggers Step Functions state machine on schedule
- No Lambda needed - direct EventBridge → Step Functions integration
- Configurable cron expression for different schedules
3. Deployment Artifacts
List of files required to deploy this solution:
| Artifact | Description |
|---|---|
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py | The main ETL logic. |
tasks/01_data_ingestion_transformation/requirements.txt | Python dependencies (pandas, pyarrow, boto3). |
tasks/04_devops_cicd/infra/terraform/main.tf | Infrastructure definition. |
tasks/04_devops_cicd/.github/workflows/ci.yml | Automation pipeline definition. |
tasks/01_data_ingestion_transformation/config.yaml | Runtime config template (bucket names, prefixes, allow-lists). The ETL currently uses CLI args/env vars; this file is a template for config-driven runs. |
4. Operational Monitoring
To ensure reliability, we emit structured logs (JSON) that CloudWatch Insights can query:
Volume Metrics:
run_id: Trace ID for the execution.input_rows: Total rows read from Bronze layer.valid_rows_count: Rows successfully validated and written to Silver layer.quarantined_rows_count: Rows quarantined due to validation failures.condemned_rows_count: Rows auto-condemned (duplicates or max attempts exceeded: ≤3 retries allowed, condemned after 3rd failure). Human review and approval required before reprocessing.
Quality Metrics:
quarantine_rate: Percentage of rows quarantined.validation_failure_rate: Quality metric.error_type_distribution: Breakdown by error type (SCHEMA_ERROR, NULL_VALUE_ERROR, etc.).
Loop Prevention Metrics:
avg_attempt_count: Average attempt_count across all processed rows.duplicate_detection_rate: Percentage of rows flagged as exact duplicates.auto_condemnation_rate: Percentage of rows auto-condemned.circuit_breaker_triggers: Count of circuit breaker activations.
Performance Metrics:
rows_processed_per_run: Throughput metric.duration_seconds: ETL execution time.missing_partitions: Completeness metric.runtime_anomalies: Performance metric.
Monitoring Alerts
- Infrastructure Alerts (Data Platform Team):
- Job failure (non-zero exit, missing
_SUCCESS). - Missing partitions (expected partitions not present).
- Runtime anomalies (unusually long execution time).
- Circuit breaker triggered (>100 same errors/hour → automatic pipeline halt, P1 - Immediate).
- Job failure (non-zero exit, missing
- Data Quality Alerts (Data Quality Team):
- Quarantine rate spike (
quarantined_rows / input_rows > 1%, P2 - 4 hours). - Validation failure (schema drift, type mismatches, P2 - 2 hours).
- High
attempt_count(avg > 1.5, P2 - 4 hours). - Auto-condemnation spike (rate > 0.5%, P2 - 4 hours).
- Quarantine rate spike (
- Business Metric Alerts (Domain Teams / Business):
- Volume anomaly (too few or too many rows vs baseline, P3 - 8 hours).
- SLA breach (data freshness, availability, P1 - 1 hour).
5. Ownership & Governance (Task 4)
5.1 Ownership Matrix
| Aspect | Owner | Steward | Execution | Responsibility |
|---|---|---|---|---|
| Pipeline Infrastructure | Data Platform Team | Platform Lead | Data Platform Team | CI/CD pipeline reliability, deployment automation, AWS Step Functions/EventBridge job scheduling, run isolation mechanisms |
| CI/CD Automation | Data Platform Team | Platform Lead | Data Platform Team | GitHub Actions workflows, Terraform IaC, artifact versioning, AWS Glue job deployment automation |
| AWS Infrastructure | Data Platform Team | Platform Lead | Data Platform Team | S3 buckets, Glue jobs, IAM policies, CloudWatch infrastructure, resource provisioning |
| Infrastructure Monitoring | Data Platform Team | Platform Lead | Data Platform Team | Job failure detection, system health monitoring, infrastructure alerting, run completeness checks |
| Validation Rules | Domain Teams (Silver) / Business (Gold) | Domain Analyst / Finance Controller | Data Platform Team | Validation logic definition, business rule specification, quality threshold configuration |
| Data Quality Monitoring | Data Quality Team | Data Quality Lead | Data Quality Team | Quarantine review, quality metric interpretation, quality threshold monitoring, source data issue triage |
| Quarantine Resolution | Data Quality Team / Domain Teams | Data Quality Lead | Data Quality Team / Domain Teams | Invalid row investigation, source data correction, quarantine workflow management |
| Silver Layer Schema | Domain Teams | Domain Analyst | Data Platform Team | Schema change approval, schema version management, validation logic evolution |
| Gold Layer Schema | Business (Finance) | Finance Controller | Data Platform Team | Business contract definition, reporting schema approval, stakeholder communication |
| Schema Implementation | Data Platform Team | Platform Lead | Data Platform Team | Technical implementation of approved schema changes, schema versioning, backward compatibility |
| Dataset Ownership Metadata | Domain Teams (Silver) / Business (Gold) | Domain Analyst / Finance Controller | Data Platform Team | Business context, dataset purpose, consumer requirements, ownership assignment |
| Run Lineage & Audit Logs | Data Platform Team | Platform Lead | Data Platform Team | Technical lineage tracking, infrastructure audit logs, run metadata capture |
| Backfill Execution | Data Platform Team | Platform Lead | Data Platform Team | Technical reprocessing, partition-level backfills, run isolation for backfills |
| Backfill Approval | Domain Teams (Silver) / Business (Gold) | Domain Analyst / Finance Controller | N/A | Business approval for reprocessing, data correction requests, historical data updates |
| Promotion to Current | Domain Teams (Silver) / Business (Gold) | Domain Analyst / Finance Controller | Data Platform Team | Validation approval, promotion decision, _LATEST.json and current/ updates |
5.4 Contact Information
| Role | Contact Method | Primary Contact | Escalation Contact |
|---|---|---|---|
| Platform Lead | Email: [email protected]; Slack: #data-platform | Platform Lead | Infrastructure Manager |
| Domain Analyst | Email: [email protected]; Slack: #data-domain | Domain Analyst | Domain Manager |
| Finance Controller | Email: [email protected]; Slack: #finance-data | Finance Controller | Finance Director |
| Data Quality Lead | Email: [email protected]; Slack: #data-quality | Data Quality Lead | Data Quality Manager |
| On-Call Engineer | PagerDuty: data-platform-oncall; Slack: @oncall-data-platform | On-Call Rotation | Platform Lead |
Note: Replace placeholder email addresses and Slack channels with actual organizational contacts. Contact information should be maintained in a centralized directory (e.g., company wiki, directory service) and kept up-to-date.
5.2 Alert Ownership & Escalation
| Alert Type | Primary Owner | Escalation Path | Response Time |
|---|---|---|---|
Job Failure (non-zero exit, missing _SUCCESS) | Data Platform Team | Platform Lead → On-call Engineer | Immediate (P1) |
| Infrastructure Errors (S3 access failures, Glue job crashes) | Data Platform Team | Platform Lead → Infrastructure Team | Immediate (P1) |
| Circuit Breaker Triggered (>100 same errors/hour, pipeline halted) | Data Platform Team | Platform Lead → On-call Engineer | Immediate (P1) |
Quarantine Rate Spike (quarantined_rows / input_rows > 1%) | Data Quality Team | Data Quality Lead → Domain Teams | 4 hours (P2) |
| Validation Failure (schema drift, type mismatches) | Data Quality Team | Data Quality Lead → Domain Teams → Platform Team | 2 hours (P2) |
High Attempt Count (avg attempt_count > 1.5) | Data Quality Team | Data Quality Lead → Domain Teams | 4 hours (P2) |
| Auto-Condemnation Spike (auto-condemnation rate > 0.5%) | Data Quality Team | Data Quality Lead → Domain Teams | 4 hours (P2) |
| Volume Anomaly (too few/many rows vs baseline) | Domain Teams | Domain Analyst → Data Quality Team | 8 hours (P3) |
| Missing Partitions (expected partitions not present) | Data Platform Team | Platform Lead → Domain Teams | 4 hours (P2) |
| Runtime Anomalies (unusually long execution time) | Data Platform Team | Platform Lead → Infrastructure Team | 4 hours (P2) |
| SLA Breach (data freshness, availability) | Domain Teams / Business | Domain Analyst / Finance Controller → Platform Team | 1 hour (P1) |
5.3 Operational Responsibilities Matrix
| Responsibility Type | Bronze Layer | Silver Layer | Gold Layer |
|---|---|---|---|
| Ownership (who decides) | Data Platform Team | Domain Teams | Business (Finance) |
| Stewardship (who maintains) | Ingestion Lead | Domain Analyst | Finance Controller |
| Execution (who implements) | Data Platform Team | Data Platform Team | Data Platform Team |
| Consumption (who uses) | Platform engineers, audit | Analysts, data scientists | Finance, BI, stakeholders |
| Change Approval | Platform Lead | Domain Analyst + Platform review | Finance Controller + Platform review |
| Quality Monitoring | Platform Team (ingestion reliability) | Data Quality Team (validation rules) | Business (reporting accuracy) |
5.4 Governance Workflows
Schema Change Workflow
Data Quality Issue Resolution Workflow
Backfill Approval Workflow
5.5 Governance Rules
Infrastructure & Platform Rules
- Platform team owns all pipeline reliability, deployment automation, and infrastructure provisioning.
- All infrastructure changes must go through Terraform IaC and CI/CD pipeline.
- Failed runs never update
_LATEST.jsonorcurrent/(explicit promotion only). - Run isolation via
run_idis mandatory for all ETL executions. - Human approval required before promoting Silver layer data to production (see
HUMAN_VALIDATION_POLICY.md). - Human approval required before deleting condemned data (see
HUMAN_VALIDATION_POLICY.md).
Data Quality Rules
- Data Quality Team owns quarantine review and quality metric interpretation.
- Quarantine rate thresholds are configurable per dataset (default: 1%).
- All invalid rows must be preserved in quarantine (never silently dropped).
- Quality alerts require Data Quality Team triage before escalation.
Schema Governance Rules
- Silver layer schema changes require Domain Team approval + Platform Team implementation.
- Gold layer schema changes require Business/Finance approval + Platform Team implementation.
- All schema changes must be versioned via
schema_vfor backward compatibility. - Schema changes require quality validation and backfill if needed.
Backfill & Reprocessing Rules
- Technical execution of backfills is owned by Platform Team.
- Business approval for backfills is required from Domain Teams (Silver) or Business (Gold).
- All backfills write to new
run_idpaths (no overwrites). - Promotion to
current/requires explicit validation and approval (Gold layer only - Gold layer structure, governance, and ownership model are best described in Task 2 architecture design). - Human approval required before promoting Silver layer data to production (see
HUMAN_VALIDATION_POLICY.md).
Monitoring & Alerting Rules
- Infrastructure alerts (P1) route to Platform Team for immediate response.
- Data quality alerts (P2) route to Data Quality Team for investigation.
- Business metric alerts (P3) route to Domain Teams for review.
- Alert ownership is clearly defined with escalation paths documented.
Alignment with Task 2 Architecture
- Bronze layer ownership aligns with Task 2: Platform Team (immutability, ingestion).
- Silver layer ownership aligns with Task 2: Domain Teams (validation, schema).
- Gold layer ownership aligns with Task 2 architecture: Business/Finance (contracts, reporting). The Gold layer structure, governance, and ownership model are best described in Task 2.
- Governance workflows are consistent across Task 2 and Task 4 documentation.
Related Documentation
Task 4 Documentation
- CI/CD Testing - How to test CI/CD workflows locally
- Test Suite Summary - Test suite implementation details
- Test Suite Documentation - Detailed test documentation
- Scripts Documentation - CI/CD testing scripts
Related Tasks
- ETL Pipeline - What this CI/CD deploys
- Data Lake Architecture - Infrastructure this CI/CD provisions
- SQL Query - Code validated by this CI/CD
- Terraform Configuration - Infrastructure as code
Technical Documentation
- Unified Testing Convention - Testing standards
- Testing Guide - Comprehensive testing documentation