Databricks Platform Administrator

Overview

On Site

$100 - $125

Contract - W2

Contract - Independent

Contract - 12 Month(s)

Skills

Amazon Web Services

Databricks

GitHub

AWS

IAM

SCIM

Terraform

Python

Disaster Recovery

High Availability

Unity Catalog

MLflow

Mosaic AI

Job Details

We are seeking an experienced, hands-on Databricks Platform Administrator to lead the operational management, governance, and resilience of our Databricks Lakehouse environment. This role blends platform architecture with automation, monitoring, and support responsibilities. You will ensure that the Databricks platform including MLflow, MLOps pipelines, Mosaic AI, and other critical capabilities is stable, secure, scalable, cost-effective, and resilient. The ideal candidate is an expert in operating complex Databricks environments with a strong focus on disaster recovery, high availability, and ML/AI platform readiness.

What You'll Do

Own Databricks Platform Operations: Act as the primary administrator for Databricks workspaces, managing user provisioning, cluster governance, workspace configuration, job orchestration, and usage policies.
Administer AI/ML Capabilities: Support and maintain the operational use of MLflow, MLOps pipelines, and Mosaic AI, ensuring enterprise-grade readiness for AI/ML experimentation, deployment, and observability.
Ensure Resilience: Design, implement, and validate disaster recovery and high availability strategies for the Databricks platform, including multi-region backups, failover planning, and infrastructure redundancy.
Automate Infrastructure: Use Terraform and Python to fully automate platform provisioning, updates, and decommissioning ensuring repeatability, compliance, and configuration consistency.
Govern Access and Security: Manage enterprise-grade access control through Unity Catalog, SCIM-based identity management, and robust workspace isolation, including audit and compliance readiness.
Monitor and Optimize Usage: Oversee platform performance and cost, enforce cluster policies, optimize job and resource usage, and implement observability pipelines for operational insight.
Standardize Platform Practices: Establish and enforce reusable patterns, operational runbooks, cluster templates, ML model lifecycle standards, and AI agent deployment policies.
Support and Enable Users: Serve as a trusted partner to data engineering, data science, and analytics teams, offering hands-on operational support and platform onboarding.
Coordinate Feature Rollouts: Lead rollout and adoption of new features (e.g., Mosaic AI, Unity Catalog, Delta Live Tables, MLflow integrations), including documentation, testing, and change control.
Train and Evangelize: Create and deliver training to promote responsible and efficient platform use, with a focus on reliability, automation, and AI/ML lifecycle operations.

Core Qualifications:

10+ years in cloud infrastructure, platform operations, or data platform administration roles.
Proven track record managing Databricks or cloud data platforms at scale, including security, cost governance, resilience, and AI/ML enablement.
Experience administrating or architecting a data lake / data lakehouse environment
Strong cross-functional communication and collaboration skills.
A mindset focused on platform stability, automation, disaster recovery, and enablement of data and ML workflows.

Technical Expertise: Databricks Platform Operations:

Unity Catalog Governance, access control, lineage
MLflow Model tracking, registry, lifecycle management
Mosaic AI AI agent orchestration and observability
Delta Live Tables Operational pipeline orchestration
Workspace Management Multi-tenant configurations, role isolation

Disaster Recovery & High Availability:

Design and maintenance of DR plans, multi-region backups, failover testing, and HA architecture to ensure business continuity

Infrastructure as Code & Automation:

Terraform + Python automation for all provisioning and lifecycle changes

Security & Compliance:

Role-based access, SCIM provisioning, audit logging, and data governance enforcement

Cost and Performance Optimization:

Cluster policy tuning, tagging, monitoring, and platform usage analytics

CI/CD for Data & ML:

Automated deployment pipelines using GitHub Actions or equivalent

Cloud & Integration:

AWS core services (S3, IAM, networking) and integrations with Databricks

Observability:

Health monitoring, alerting, logging, and platform metrics dashboards

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Improving Corporate Services

Share