Skip to main content

CI/CD Workflow & Infrastructure

Important Note on Scope

⚠️ Scope Disclaimer: This CI/CD workflow design is based on the case study requirements and my interpretation of the problem. I have made assumptions that may underestimate the current scope of operations at Ohpen. The workflow presented here would need refinement based on:

  • Actual deployment frequency and release processes
  • Existing CI/CD infrastructure and tooling
  • Team size, skills, and operational maturity
  • Production change management and approval processes
  • Real-world monitoring and alerting requirements

This design demonstrates DevOps thinking and best practices, but would require collaboration with the Ohpen team for production implementation.


1. Pipeline Design (GitHub Actions)

We implement a "History-Safe" CI/CD process. Since we support backfills and reprocessing, our deployment pipeline must support versioning and safe rollouts.

Workflow Stages

  1. Validation (CI): Runs on every Pull Request.
    • ruff linting (code style).
    • pytest unit tests (partition logic, null handling, quarantine checks).
  2. Artifact Build:
    • Packages the Python ETL code.
    • Tags artifact with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip).
  3. Deployment (CD):
    • Uploads artifact to S3 (Code Bucket).
    • Terraform plan & apply to update AWS infrastructure (Glue Jobs, IAM, Buckets).
    • Updates the Glue Job to point to the new artifact.

Backfill Safety Checks

Our tests specifically cover "history safety":

  • Determinism: Rerunning the same input produces the exact same counts.
  • Partitioning: Timestamps strictly map to the correct year=YYYY/month=MM folder.
  • Quarantine: Invalid rows are never silently dropped; they must appear in quarantine.

Failure Handling

Critical Rule: Failed runs do not update _LATEST.json or current/ prefix.

Failure Scenarios & Behavior:

  1. ETL Job Failure (No Output):

    • Job exits with non-zero status.
    • No _SUCCESS marker is written.
    • No data files are written (or partial files are ignored).
    • _LATEST.json remains unchanged (points to previous successful run).
    • current/ prefix remains unchanged (stable for SQL queries).
    • Action: Alert triggers, platform team investigates, job can be rerun safely.
  2. Partial Write (Job Crashes Mid-Execution):

    • Some data files may be written to run_id={...}/ path.
    • _SUCCESS marker is missing (incomplete run).
    • Consumers ignore the run (only read runs with _SUCCESS).
    • _LATEST.json and current/ remain unchanged.
    • Action: Platform team can safely rerun (new run_id), or clean up partial files.
  3. Validation Failure (Data Quality Issues):

    • Job completes but validation checks fail (e.g., quarantine rate > threshold).
    • _SUCCESS marker may or may not be written (depends on validation stage).
    • If validation fails before _SUCCESS, run is treated as failed.
    • _LATEST.json and current/ remain unchanged.
    • Action: Data quality team reviews quarantine, fixes source data, reruns.
  4. Circuit Breaker Triggered:

    • Pipeline halts automatically when >100 same errors occur within 1 hour.
    • Job exits with RuntimeError (non-zero status).
    • No _SUCCESS marker is written.
    • No data files are written (or partial files are ignored).
    • _LATEST.json and current/ remain unchanged.
    • Action: Platform team investigates root cause, fixes systemic issue, manually restarts pipeline.
  5. Schema Validation Failure:

    • Schema drift detected (unexpected columns, type mismatches).
    • Job fails fast before writing any output.
    • _LATEST.json and current/ remain unchanged.
    • Action: Schema registry updated, job rerun with new schema version.

Safe Rerun Behavior:

  • Each rerun uses a new run_id (timestamp-based).
  • Previous failed runs remain in storage (audit trail).
  • Only successful runs (with _SUCCESS) are considered for promotion.
  • Promotion to _LATEST.json and current/ is a separate, explicit step after validation.

Promotion Workflow

⚠️ IMPORTANT: For financial data compliance, human approval is required before promoting Silver layer data to production. See HUMAN_VALIDATION_POLICY.md for details.


1. ETL writes to: silver/.../schema_v=v1/run_id=20260121T120000Z/... (isolated run_id path)
1. ETL writes _SUCCESS marker with metrics
1. CloudWatch alarm triggers: "New Run Available for Review" (P2 - 4 hours)
1. Human Review Required:
- Domain Analyst reviews quality metrics (quarantine rate, volume, schema)
- Business validation (sample data review)
- Technical validation (Platform Team)
1. If approved: Update Glue Catalog → promote to production consumption
1. If rejected: Leave run in isolated path, notify Platform Team to investigate
1. Audit: All approvals logged with timestamp, approver, metrics, reason

Note: _LATEST.json and current/ prefix are for Gold layer (**Gold layer structure, governance, and ownership model are best described in Task 2** architecture design), not Silver layer (Task 1).


2. Infrastructure as Code (Terraform)

We use Terraform to manage the Data Lake lifecycle.

Key Resources

  • S3 Buckets: raw, processed, quarantine, code-artifacts.
    • Policy: Block public access, enable versioning (critical for recovery).
  • IAM Roles: Least-privilege role for the ETL job (Read Raw, Write Processed/Quarantine).
    • Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
      • ETL Job Role: s3://bucket/bronze/* (read), s3://bucket/silver/* (write), s3://bucket/quarantine/* (write)
      • Platform Team: s3://bucket/bronze/*, s3://bucket/silver/*, s3://bucket/quarantine/* (read/write)
      • Domain Teams: s3://bucket/silver/{domain}/* (write), s3://bucket/gold/{domain}/* (read)
      • Business/Analysts: s3://bucket/gold/* (read-only via Athena)
      • Compliance: s3://bucket/bronze/*, s3://bucket/quarantine/* (read-only for audit)
  • AWS Glue Job: Defines the Python Shell or Spark job, injected with the S3 path to the script.
  • AWS Step Functions: ✅ Implemented - Orchestrates scheduled ETL runs with automatic retry and error handling
  • EventBridge: ✅ Implemented - Schedules daily ETL runs (configurable cron expression)
  • AWS Lambda: ❌ Not Implemented - Step Functions provides better orchestration; Lambda not needed
  • Monitoring: CloudWatch Alarms.
    • Alarm: QuarantineRows > 0 (investigate data quality issues).
    • Alarm: JobFailure.

Step Functions Orchestration

What Events Are Scheduled:

  • Daily ETL Runs: Process new transaction CSV files from Bronze layer
    • Default schedule: Daily at 2 AM UTC (cron(0 2 * * ? *))
    • Alternative: Monthly on 1st day (cron(0 2 1 * ? *))
    • Configurable via EventBridge rule

What Step Functions Does:

  1. RunETL State:

    • Invokes AWS Glue Spark job synchronously
    • Automatically retries on transient failures (Glue throttling, service exceptions)
    • Max ≤3 retry attempts with exponential backoff (Step Functions retry logic)
  2. ValidateOutput State:

    • Checks for _SUCCESS marker in Silver layer output
    • Verifies ETL job completed successfully
    • Retries if marker not found (handles eventual consistency)
  3. Error Handling:

    • Catches all failures and transitions to HandleFailure state
    • Publishes failure metrics to CloudWatch
    • Logs execution details for debugging

Benefits:

  • Automatic Retry: Handles transient Glue failures automatically
  • Visual Monitoring: Step Functions console shows execution flow
  • Error Recovery: Built-in error handling and state management
  • Cost: ~$0.01/month for daily runs (negligible)

EventBridge Integration:

  • EventBridge rule triggers Step Functions state machine on schedule
  • No Lambda needed - direct EventBridge → Step Functions integration
  • Configurable cron expression for different schedules

3. Deployment Artifacts

List of files required to deploy this solution:

ArtifactDescription
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.pyThe main ETL logic.
tasks/01_data_ingestion_transformation/requirements.txtPython dependencies (pandas, pyarrow, boto3).
tasks/04_devops_cicd/infra/terraform/main.tfInfrastructure definition.
tasks/04_devops_cicd/.github/workflows/ci.ymlAutomation pipeline definition.
tasks/01_data_ingestion_transformation/config.yamlRuntime config template (bucket names, prefixes, allow-lists). The ETL currently uses CLI args/env vars; this file is a template for config-driven runs.

4. Operational Monitoring

To ensure reliability, we emit structured logs (JSON) that CloudWatch Insights can query:

Volume Metrics:

  • run_id: Trace ID for the execution.
  • input_rows: Total rows read from Bronze layer.
  • valid_rows_count: Rows successfully validated and written to Silver layer.
  • quarantined_rows_count: Rows quarantined due to validation failures.
  • condemned_rows_count: Rows auto-condemned (duplicates or max attempts exceeded: ≤3 retries allowed, condemned after 3rd failure). Human review and approval required before reprocessing.

Quality Metrics:

  • quarantine_rate: Percentage of rows quarantined.
  • validation_failure_rate: Quality metric.
  • error_type_distribution: Breakdown by error type (SCHEMA_ERROR, NULL_VALUE_ERROR, etc.).

Loop Prevention Metrics:

  • avg_attempt_count: Average attempt_count across all processed rows.
  • duplicate_detection_rate: Percentage of rows flagged as exact duplicates.
  • auto_condemnation_rate: Percentage of rows auto-condemned.
  • circuit_breaker_triggers: Count of circuit breaker activations.

Performance Metrics:

  • rows_processed_per_run: Throughput metric.
  • duration_seconds: ETL execution time.
  • missing_partitions: Completeness metric.
  • runtime_anomalies: Performance metric.

Monitoring Alerts

  • Infrastructure Alerts (Data Platform Team):
    • Job failure (non-zero exit, missing _SUCCESS).
    • Missing partitions (expected partitions not present).
    • Runtime anomalies (unusually long execution time).
    • Circuit breaker triggered (>100 same errors/hour → automatic pipeline halt, P1 - Immediate).
  • Data Quality Alerts (Data Quality Team):
    • Quarantine rate spike (quarantined_rows / input_rows > 1%, P2 - 4 hours).
    • Validation failure (schema drift, type mismatches, P2 - 2 hours).
    • High attempt_count (avg > 1.5, P2 - 4 hours).
    • Auto-condemnation spike (rate > 0.5%, P2 - 4 hours).
  • Business Metric Alerts (Domain Teams / Business):
    • Volume anomaly (too few or too many rows vs baseline, P3 - 8 hours).
    • SLA breach (data freshness, availability, P1 - 1 hour).

5. Ownership & Governance (Task 4)

5.1 Ownership Matrix

AspectOwnerStewardExecutionResponsibility
Pipeline InfrastructureData Platform TeamPlatform LeadData Platform TeamCI/CD pipeline reliability, deployment automation, AWS Step Functions/EventBridge job scheduling, run isolation mechanisms
CI/CD AutomationData Platform TeamPlatform LeadData Platform TeamGitHub Actions workflows, Terraform IaC, artifact versioning, AWS Glue job deployment automation
AWS InfrastructureData Platform TeamPlatform LeadData Platform TeamS3 buckets, Glue jobs, IAM policies, CloudWatch infrastructure, resource provisioning
Infrastructure MonitoringData Platform TeamPlatform LeadData Platform TeamJob failure detection, system health monitoring, infrastructure alerting, run completeness checks
Validation RulesDomain Teams (Silver) / Business (Gold)Domain Analyst / Finance ControllerData Platform TeamValidation logic definition, business rule specification, quality threshold configuration
Data Quality MonitoringData Quality TeamData Quality LeadData Quality TeamQuarantine review, quality metric interpretation, quality threshold monitoring, source data issue triage
Quarantine ResolutionData Quality Team / Domain TeamsData Quality LeadData Quality Team / Domain TeamsInvalid row investigation, source data correction, quarantine workflow management
Silver Layer SchemaDomain TeamsDomain AnalystData Platform TeamSchema change approval, schema version management, validation logic evolution
Gold Layer SchemaBusiness (Finance)Finance ControllerData Platform TeamBusiness contract definition, reporting schema approval, stakeholder communication
Schema ImplementationData Platform TeamPlatform LeadData Platform TeamTechnical implementation of approved schema changes, schema versioning, backward compatibility
Dataset Ownership MetadataDomain Teams (Silver) / Business (Gold)Domain Analyst / Finance ControllerData Platform TeamBusiness context, dataset purpose, consumer requirements, ownership assignment
Run Lineage & Audit LogsData Platform TeamPlatform LeadData Platform TeamTechnical lineage tracking, infrastructure audit logs, run metadata capture
Backfill ExecutionData Platform TeamPlatform LeadData Platform TeamTechnical reprocessing, partition-level backfills, run isolation for backfills
Backfill ApprovalDomain Teams (Silver) / Business (Gold)Domain Analyst / Finance ControllerN/ABusiness approval for reprocessing, data correction requests, historical data updates
Promotion to CurrentDomain Teams (Silver) / Business (Gold)Domain Analyst / Finance ControllerData Platform TeamValidation approval, promotion decision, _LATEST.json and current/ updates

5.4 Contact Information

RoleContact MethodPrimary ContactEscalation Contact
Platform LeadEmail: [email protected]; Slack: #data-platformPlatform LeadInfrastructure Manager
Domain AnalystEmail: [email protected]; Slack: #data-domainDomain AnalystDomain Manager
Finance ControllerEmail: [email protected]; Slack: #finance-dataFinance ControllerFinance Director
Data Quality LeadEmail: [email protected]; Slack: #data-qualityData Quality LeadData Quality Manager
On-Call EngineerPagerDuty: data-platform-oncall; Slack: @oncall-data-platformOn-Call RotationPlatform Lead

Note: Replace placeholder email addresses and Slack channels with actual organizational contacts. Contact information should be maintained in a centralized directory (e.g., company wiki, directory service) and kept up-to-date.

5.2 Alert Ownership & Escalation

Alert TypePrimary OwnerEscalation PathResponse Time
Job Failure (non-zero exit, missing _SUCCESS)Data Platform TeamPlatform Lead → On-call EngineerImmediate (P1)
Infrastructure Errors (S3 access failures, Glue job crashes)Data Platform TeamPlatform Lead → Infrastructure TeamImmediate (P1)
Circuit Breaker Triggered (>100 same errors/hour, pipeline halted)Data Platform TeamPlatform Lead → On-call EngineerImmediate (P1)
Quarantine Rate Spike (quarantined_rows / input_rows > 1%)Data Quality TeamData Quality Lead → Domain Teams4 hours (P2)
Validation Failure (schema drift, type mismatches)Data Quality TeamData Quality Lead → Domain Teams → Platform Team2 hours (P2)
High Attempt Count (avg attempt_count > 1.5)Data Quality TeamData Quality Lead → Domain Teams4 hours (P2)
Auto-Condemnation Spike (auto-condemnation rate > 0.5%)Data Quality TeamData Quality Lead → Domain Teams4 hours (P2)
Volume Anomaly (too few/many rows vs baseline)Domain TeamsDomain Analyst → Data Quality Team8 hours (P3)
Missing Partitions (expected partitions not present)Data Platform TeamPlatform Lead → Domain Teams4 hours (P2)
Runtime Anomalies (unusually long execution time)Data Platform TeamPlatform Lead → Infrastructure Team4 hours (P2)
SLA Breach (data freshness, availability)Domain Teams / BusinessDomain Analyst / Finance Controller → Platform Team1 hour (P1)

5.3 Operational Responsibilities Matrix

Responsibility TypeBronze LayerSilver LayerGold Layer
Ownership (who decides)Data Platform TeamDomain TeamsBusiness (Finance)
Stewardship (who maintains)Ingestion LeadDomain AnalystFinance Controller
Execution (who implements)Data Platform TeamData Platform TeamData Platform Team
Consumption (who uses)Platform engineers, auditAnalysts, data scientistsFinance, BI, stakeholders
Change ApprovalPlatform LeadDomain Analyst + Platform reviewFinance Controller + Platform review
Quality MonitoringPlatform Team (ingestion reliability)Data Quality Team (validation rules)Business (reporting accuracy)

5.4 Governance Workflows

Schema Change Workflow

Data Quality Issue Resolution Workflow

Backfill Approval Workflow

5.5 Governance Rules

Infrastructure & Platform Rules

  • Platform team owns all pipeline reliability, deployment automation, and infrastructure provisioning.
  • All infrastructure changes must go through Terraform IaC and CI/CD pipeline.
  • Failed runs never update _LATEST.json or current/ (explicit promotion only).
  • Run isolation via run_id is mandatory for all ETL executions.
  • Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).
  • Human approval required before deleting condemned data (see HUMAN_VALIDATION_POLICY.md).

Data Quality Rules

  • Data Quality Team owns quarantine review and quality metric interpretation.
  • Quarantine rate thresholds are configurable per dataset (default: 1%).
  • All invalid rows must be preserved in quarantine (never silently dropped).
  • Quality alerts require Data Quality Team triage before escalation.

Schema Governance Rules

  • Silver layer schema changes require Domain Team approval + Platform Team implementation.
  • Gold layer schema changes require Business/Finance approval + Platform Team implementation.
  • All schema changes must be versioned via schema_v for backward compatibility.
  • Schema changes require quality validation and backfill if needed.

Backfill & Reprocessing Rules

  • Technical execution of backfills is owned by Platform Team.
  • Business approval for backfills is required from Domain Teams (Silver) or Business (Gold).
  • All backfills write to new run_id paths (no overwrites).
  • Promotion to current/ requires explicit validation and approval (Gold layer only - Gold layer structure, governance, and ownership model are best described in Task 2 architecture design).
  • Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).

Monitoring & Alerting Rules

  • Infrastructure alerts (P1) route to Platform Team for immediate response.
  • Data quality alerts (P2) route to Data Quality Team for investigation.
  • Business metric alerts (P3) route to Domain Teams for review.
  • Alert ownership is clearly defined with escalation paths documented.

Alignment with Task 2 Architecture

  • Bronze layer ownership aligns with Task 2: Platform Team (immutability, ingestion).
  • Silver layer ownership aligns with Task 2: Domain Teams (validation, schema).
  • Gold layer ownership aligns with Task 2 architecture: Business/Finance (contracts, reporting). The Gold layer structure, governance, and ownership model are best described in Task 2.
  • Governance workflows are consistent across Task 2 and Task 4 documentation.

Task 4 Documentation

Technical Documentation