Overview
Full Time
Skills
Scheduling
FOCUS
Build Automation
Data Analysis
GitHub
GitLab
Jenkins
Dashboard
Grafana
SLA
Slack
Concurrent Computing
Incident Management
TLS
PostgreSQL
MySQL
Celery
Redis
RabbitMQ
Machine Learning (ML)
Documentation
Mentorship
Testing
Sensors
Mapping
Kubernetes
Network
Terraform
Continuous Delivery
Oracle Policy Automation
Python
Bash
Java
Modeling
Performance Tuning
Microsoft Azure
Google Cloud Platform
Google Cloud
Snow Flake Schema
Amazon Redshift
Databricks
Apache Spark
Electronic Health Record (EHR)
Virtual Private Cloud
Computer Networking
Storage
Amazon S3
Regulatory Compliance
SSO
OIDC
RBAC
Auditing
Change Control
Leadership
Roadmaps
Communication
Workflow
Amazon Web Services
Step-Functions
Migration
Data Quality
Cloud Computing
Orchestration
Management
Continuous Integration
Optimization
Capacity Management
High Availability
Meta-data Management
Database
Backup
Recovery
Apache Airflow
Job Details
We're seeking a seasoned engineer to design, operate, and scale our workflow orchestration platform with a primary focus on Apache Airflow. You'll own the Airflow control plane and developer experience end-to-end-architecture, automation, security, observability, and reliability-while also evaluating and running complementary schedulers where they make sense (e.g., Prefect, Kubernetes CronJobs). You'll build automation infrastructure (IaC, Helm, GitOps, CI/CD) and partner with data, analytics, and ML teams to deliver fast, reliable pipelines.
What you'll do
Required qualifications
Nice to have
What you'll do
- Architect, deploy, and operate production-grade Airflow (self-managed or managed like MWAA/Cloud Composer/Astronomer), including upgrades, capacity planning, HA, and performance tuning.
- Run Airflow on Kubernetes using Helm and GitOps; configure Executors (KubernetesExecutor or Celery on K8s with CeleryKubernetesExecutor), autoscaling (e.g., KEDA), resource quotas, PDBs, and rolling strategies.
- Build and maintain automation infrastructure: Terraform/Helm modules, GitOps (Argo CD/Flux), CI/CD pipelines (GitHub Actions/GitLab/Jenkins) for environment creation, upgrades, and zero/low-downtime rollouts.
- Standardize the developer experience: DAG repo templates, shared operator/hook libraries, connection/secrets management, packaging/constraints, code owners, linting (ruff/flake8), unit tests/pytest, and pre-commit checks.
- Implement observability: metrics (StatsD/Prometheus), dashboards (Grafana), structured logs (ELK/OpenSearch), tracing (OpenTelemetry), SLA/latency tracking, alerting (PagerDuty/Opsgenie/Slack), and automated remediation.
- Drive reliability: pools/queues/concurrency policies, retries/backoff, idempotency patterns, deferrable operators/sensors, backfills, datasets and cross-DAG dependencies, runbooks, and incident response/postmortems.
- Secure the platform: SSO/OIDC, RBAC, least-privilege connections, network policies, TLS, secrets management (Vault/Secrets Manager/Kubernetes Secrets), audit logging, and compliance automation/policy-as-code.
- Manage platform components: metadata DB (Postgres/MySQL), Celery brokers/backends (Redis/RabbitMQ), provider packages, and controlled plugin lifecycle; plan and execute Airflow 2.x upgrades/migrations.
- Integrate data quality and lineage: Great Expectations/dbt tests, OpenLineage/Marquez; enforce quality gates in CI/CD and at runtime.
- Orchestrate across the data/ML ecosystem: Snowflake/BigQuery/Redshift, Databricks/Spark/EMR/Dataproc, dbt Core/Cloud, object storage (S3S/ADLS), event and batch workloads.
- Evaluate and, where appropriate, operate complementary schedulers (Prefect, Dagster, Argo Workflows, Kubernetes CronJobs, AWS Step Functions) and lead migrations from legacy orchestrators.
- Partner closely with platform, data, and ML teams; provide enablement, documentation, and self-service tooling. Mentor engineers and contribute to roadmap and standards.
Required qualifications
- 5-8+ years building/operating data or platform systems; 3+ years running Airflow in production at scale (hundreds-thousands of DAGs and high task throughput).
- Deep Airflow expertise: DAG design and testing, idempotency, deferrable operators/sensors, dynamic task mapping, task groups, datasets, pools/queues, SLAs, retries/backfills, cross-DAG dependencies.
- Strong Kubernetes experience running Airflow and supporting services: Helm, autoscaling, node/pod tuning, topology spread, network policies, PDBs, and blue/green or canary strategies.
- Automation-first mindset: Terraform, Helm, GitOps (Argo CD/Flux), and CI/CD for platform lifecycle; policy-as-code (OPA/Gatekeeper/Conftest) for DAG, connection, and secrets changes.
- Proficiency in Python for authoring operators/hooks/utilities; solid Bash; familiarity with Go or Java is a plus.
- Observability and SRE practices: PrometheGrafana/StatsD, centralized logging, alert design, capacity/throughput modeling, performance tuning.
- Data platform experience with at least one major cloud (AWS/Azure/Google Cloud Platform) and systems like Snowflake/BigQuery/Redshift, Databricks/Spark, EMR/Dataproc; strong grasp of IAM, VPC networking, and storage (S3S/ADLS).
- Security/compliance: SSO/OIDC, RBAC, secrets management (Vault/Secrets Manager), auditing, least-privilege connection management, and change control.
- Proven incident leadership, runbook creation, and platform roadmap execution; excellent cross-functional communication.
Nice to have
- Experience operating alternative orchestrators (Prefect 2.x, Dagster, Argo Workflows, AWS Step Functions) and leading migrations to/from Airflow.
- OpenLineage/Marquez adoption; Great Expectations or other data quality frameworks; data contracts.
- dbt Core/Cloud orchestration patterns (state management, artifacts, slim CI).
- Cost optimization and capacity planning for schedulers and workers; spot instance strategies.
- Multi-region HA/DR for Airflow metadata DB; backup/restore and disaster drills.
- Building internal developer platforms/portals (e.g., Backstage) for self-service pipelines.
- Contributions to Apache Airflow or provider packages; familiarity with recent AIPs/Airflow 2.7+ features.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.