SRE Kafka Lead

Orlando, FLORIDA, US • Posted 5 hours ago • Updated 1 hour ago
Contract W2
On-site
DOE
Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

  • DevOps
  • Service Level
  • High Availability
  • Disaster Recovery
  • Failover
  • Security Controls
  • Access Control
  • Regulatory Compliance
  • Standard Operating Procedure
  • Knowledge Transfer
  • Collaboration
  • Roadmaps
  • Continuous Improvement
  • Scalability
  • Operational Excellence
  • IT Management
  • Performance Tuning
  • Replication
  • Optimization
  • Authentication
  • Authorization
  • Encryption
  • Scripting
  • Python
  • Bash
  • Terraform
  • Continuous Integration
  • Continuous Delivery
  • Reliability Engineering
  • Management
  • Incident Management
  • Root Cause Analysis
  • Computer Networking
  • DNS
  • Dragon NaturallySpeaking
  • Load Balancing
  • Linux Administration
  • Systems Architecture
  • Communication
  • Documentation
  • Leadership
  • Stakeholder Management
  • Supervision
  • Dynatrace
  • Mentorship
  • Cloud Computing
  • Streaming
  • Apache Kafka
  • Budget

Summary

Job Summary The SRE (Kafka Lead) is responsible for leading the architecture, engineering, reliability, automation, security, and operational excellence of enterprise Kafka platforms. This role serves as the technical leader for Kafka and distributed streaming solutions, driving platform design, operational readiness, observability, automation, and reliability engineering initiatives. The ideal candidate combines deep Kafka expertise, Site Reliability Engineering (SRE) practices, DevOps automation, and strong leadership capabilities to deliver highly available, secure, and scalable streaming platforms. Key Responsibilities Serve as the technical lead for Kafka platform architecture, engineering, and operational strategy. Define and implement Kafka architectures aligned with industry best practices for distributed systems and event-driven platforms. Provide technical leadership and direction to engineering teams, operations teams, and managed service providers. Identify technical risks, architectural gaps, scalability concerns, and improvement opportunities. Own platform outcomes and ensure successful delivery of reliability, scalability, and operational objectives. Design, deploy, configure, scale, and optimize Kafka clusters in production environments. Design and implement topic strategies, partitioning models, replication configurations, and consumer/producer optimization techniques. Configure and support Kafka ecosystem components including Kafka Connect, Kafka Streams, Schema Registry, and related technologies. Troubleshoot and resolve complex production incidents affecting Kafka platforms and distributed systems. Develop and maintain automation solutions using scripting languages such as Python and Bash. Implement Infrastructure-as-Code solutions using Terraform and similar technologies. Design, develop, and support CI/CD pipelines for Kafka platform deployments and operational processes. Eliminate manual operational dependencies through automation and self-service capabilities. Design and implement end-to-end observability solutions including metrics, logging, tracing, and monitoring. Establish Kafka-specific monitoring for consumer lag, throughput, broker health, cluster performance, and system availability. Integrate Kafka monitoring and observability solutions with enterprise monitoring platforms. Define and implement alerting strategies aligned to SLAs, SLOs, and operational requirements. Establish and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Design high-availability, resiliency, disaster recovery, and failover strategies for Kafka platforms. Lead incident management, troubleshooting, post-incident reviews, and Root Cause Analysis (RCA) activities. Implement and enforce enterprise security controls including encryption, authentication, authorization, access controls, and auditability. Ensure compliance with organizational security policies and governance standards. Develop operational documentation, runbooks, standard operating procedures, and knowledge transfer materials. Facilitate knowledge transfer and operational readiness activities for internal engineering and support teams. Collaborate with leadership, engineering teams, and stakeholders to communicate platform status, risks, roadmaps, and strategic recommendations. Participate in critical troubleshooting sessions and ensure responsiveness during production incidents. Drive continuous improvement initiatives focused on platform reliability, scalability, automation, and operational excellence. Required Qualifications Strong experience serving as a technical lead for Kafka, distributed systems, or streaming platform environments. Deep hands-on experience with Apache Kafka administration, architecture, and operations. Experience with Kafka cluster setup, scaling, tuning, and performance optimization. Strong knowledge of topic design, partitioning strategies, replication, and producer/consumer optimization. Experience with Kafka security including ACLs, authentication, authorization, and encryption. Experience working with Kafka ecosystem technologies such as Kafka Connect, Kafka Streams, and Schema Registry. Ability to independently diagnose and resolve complex production Kafka issues. Strong scripting experience using Python, Bash, or equivalent technologies. Experience implementing Infrastructure-as-Code using Terraform or similar tools. Experience building and supporting CI/CD pipelines and automation frameworks. Strong knowledge of observability, monitoring, logging, and distributed tracing solutions. Experience implementing monitoring strategies for Kafka platforms and distributed systems. Strong understanding of Site Reliability Engineering (SRE) principles and practices. Experience defining and managing SLIs, SLOs, and error budgets. Experience leading incident response, root cause analysis, and reliability improvement initiatives. Strong understanding of networking fundamentals including DNS, ports, latency, and load balancing. Strong Linux administration and troubleshooting experience. Knowledge of security principles, distributed systems architecture, and enterprise platform operations. Excellent communication, documentation, leadership, and stakeholder management skills. Ability to work independently and drive technical initiatives with minimal supervision. Preferred Qualifications Experience integrating Kafka platforms with enterprise observability tools such as Dynatrace. Experience supporting large-scale enterprise streaming and event-driven architectures. Experience mentoring engineering teams and leading operational readiness initiatives. Experience working with managed service providers and cross-functional infrastructure teams. Experience designing highly available and resilient cloud-based data streaming platforms. Must Skills Apache Kafka SRE Discipline (SLI / SLO / Error Budgets) Automation & Infrastructure as Code Observability & Monitoring Excellence Education: Bachelors Degree
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: compun
  • Position Id: PATDC5832690
  • Posted 5 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Yesterday

Easy Apply

Contract

Depends on Experience

Remote or Arkansas

Today

Third Party, Contract

New Jersey

9d ago

Easy Apply

Full-time

Hybrid in Tampa, Florida

Today

Full-time

Search all similar jobs