Overview
Skills
Job Details
We are seeking an experienced, hands-on Databricks Platform Administrator to lead the operational management, governance, and resilience of our Databricks Lakehouse environment. This role blends platform architecture with automation, monitoring, and support responsibilities. You will ensure that the Databricks platform including MLflow, MLOps pipelines, Mosaic AI, and other critical capabilities is stable, secure, scalable, cost-effective, and resilient. The ideal candidate is an expert in operating complex Databricks environments with a strong focus on disaster recovery, high availability, and ML/AI platform readiness.
What You'll Do
- Own Databricks Platform Operations: Act as the primary administrator for Databricks workspaces, managing user provisioning, cluster governance, workspace configuration, job orchestration, and usage policies.
- Administer AI/ML Capabilities: Support and maintain the operational use of MLflow, MLOps pipelines, and Mosaic AI, ensuring enterprise-grade readiness for AI/ML experimentation, deployment, and observability.
- Ensure Resilience: Design, implement, and validate disaster recovery and high availability strategies for the Databricks platform, including multi-region backups, failover planning, and infrastructure redundancy.
- Automate Infrastructure: Use Terraform and Python to fully automate platform provisioning, updates, and decommissioning ensuring repeatability, compliance, and configuration consistency.
- Govern Access and Security: Manage enterprise-grade access control through Unity Catalog, SCIM-based identity management, and robust workspace isolation, including audit and compliance readiness.
- Monitor and Optimize Usage: Oversee platform performance and cost, enforce cluster policies, optimize job and resource usage, and implement observability pipelines for operational insight.
- Standardize Platform Practices: Establish and enforce reusable patterns, operational runbooks, cluster templates, ML model lifecycle standards, and AI agent deployment policies.
- Support and Enable Users: Serve as a trusted partner to data engineering, data science, and analytics teams, offering hands-on operational support and platform onboarding.
- Coordinate Feature Rollouts: Lead rollout and adoption of new features (e.g., Mosaic AI, Unity Catalog, Delta Live Tables, MLflow integrations), including documentation, testing, and change control.
- Train and Evangelize: Create and deliver training to promote responsible and efficient platform use, with a focus on reliability, automation, and AI/ML lifecycle operations.
Core Qualifications:
- 10+ years in cloud infrastructure, platform operations, or data platform administration roles.
- Proven track record managing Databricks or cloud data platforms at scale, including security, cost governance, resilience, and AI/ML enablement.
- Experience administrating or architecting a data lake / data lakehouse environment
- Strong cross-functional communication and collaboration skills.
- A mindset focused on platform stability, automation, disaster recovery, and enablement of data and ML workflows.
Technical Expertise: Databricks Platform Operations:
- Unity Catalog Governance, access control, lineage
- MLflow Model tracking, registry, lifecycle management
- Mosaic AI AI agent orchestration and observability
- Delta Live Tables Operational pipeline orchestration
- Workspace Management Multi-tenant configurations, role isolation
Disaster Recovery & High Availability:
- Design and maintenance of DR plans, multi-region backups, failover testing, and HA architecture to ensure business continuity
Infrastructure as Code & Automation:
- Terraform + Python automation for all provisioning and lifecycle changes
Security & Compliance:
- Role-based access, SCIM provisioning, audit logging, and data governance enforcement
Cost and Performance Optimization:
- Cluster policy tuning, tagging, monitoring, and platform usage analytics
CI/CD for Data & ML:
- Automated deployment pipelines using GitHub Actions or equivalent
Cloud & Integration:
- AWS core services (S3, IAM, networking) and integrations with Databricks
Observability:
- Health monitoring, alerting, logging, and platform metrics dashboards