Apply Now

SRE Kafka Lead

Orlando, FLORIDA, US • Posted 5 hours ago • Updated 1 hour ago

Contract W2

On-site

DOE

Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

DevOps
Service Level
High Availability
Disaster Recovery
Failover
Security Controls
Access Control
Regulatory Compliance
Standard Operating Procedure
Knowledge Transfer
Collaboration
Roadmaps
Continuous Improvement
Scalability
Operational Excellence
IT Management
Performance Tuning
Replication
Optimization
Authentication
Authorization
Encryption
Scripting
Python
Bash
Terraform
Continuous Integration
Continuous Delivery
Reliability Engineering
Management
Incident Management
Root Cause Analysis
Computer Networking
DNS
Dragon NaturallySpeaking
Load Balancing
Linux Administration
Systems Architecture
Communication
Documentation
Leadership
Stakeholder Management
Supervision
Dynatrace
Mentorship
Cloud Computing
Streaming
Apache Kafka
Budget

Summary

Job Summary The SRE (Kafka Lead) is responsible for leading the architecture, engineering, reliability, automation, security, and operational excellence of enterprise Kafka platforms. This role serves as the technical leader for Kafka and distributed streaming solutions, driving platform design, operational readiness, observability, automation, and reliability engineering initiatives. The ideal candidate combines deep Kafka expertise, Site Reliability Engineering (SRE) practices, DevOps automation, and strong leadership capabilities to deliver highly available, secure, and scalable streaming platforms. Key Responsibilities Serve as the technical lead for Kafka platform architecture, engineering, and operational strategy. Define and implement Kafka architectures aligned with industry best practices for distributed systems and event-driven platforms. Provide technical leadership and direction to engineering teams, operations teams, and managed service providers. Identify technical risks, architectural gaps, scalability concerns, and improvement opportunities. Own platform outcomes and ensure successful delivery of reliability, scalability, and operational objectives. Design, deploy, configure, scale, and optimize Kafka clusters in production environments. Design and implement topic strategies, partitioning models, replication configurations, and consumer/producer optimization techniques. Configure and support Kafka ecosystem components including Kafka Connect, Kafka Streams, Schema Registry, and related technologies. Troubleshoot and resolve complex production incidents affecting Kafka platforms and distributed systems. Develop and maintain automation solutions using scripting languages such as Python and Bash. Implement Infrastructure-as-Code solutions using Terraform and similar technologies. Design, develop, and support CI/CD pipelines for Kafka platform deployments and operational processes. Eliminate manual operational dependencies through automation and self-service capabilities. Design and implement end-to-end observability solutions including metrics, logging, tracing, and monitoring. Establish Kafka-specific monitoring for consumer lag, throughput, broker health, cluster performance, and system availability. Integrate Kafka monitoring and observability solutions with enterprise monitoring platforms. Define and implement alerting strategies aligned to SLAs, SLOs, and operational requirements. Establish and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Design high-availability, resiliency, disaster recovery, and failover strategies for Kafka platforms. Lead incident management, troubleshooting, post-incident reviews, and Root Cause Analysis (RCA) activities. Implement and enforce enterprise security controls including encryption, authentication, authorization, access controls, and auditability. Ensure compliance with organizational security policies and governance standards. Develop operational documentation, runbooks, standard operating procedures, and knowledge transfer materials. Facilitate knowledge transfer and operational readiness activities for internal engineering and support teams. Collaborate with leadership, engineering teams, and stakeholders to communicate platform status, risks, roadmaps, and strategic recommendations. Participate in critical troubleshooting sessions and ensure responsiveness during production incidents. Drive continuous improvement initiatives focused on platform reliability, scalability, automation, and operational excellence. Required Qualifications Strong experience serving as a technical lead for Kafka, distributed systems, or streaming platform environments. Deep hands-on experience with Apache Kafka administration, architecture, and operations. Experience with Kafka cluster setup, scaling, tuning, and performance optimization. Strong knowledge of topic design, partitioning strategies, replication, and producer/consumer optimization. Experience with Kafka security including ACLs, authentication, authorization, and encryption. Experience working with Kafka ecosystem technologies such as Kafka Connect, Kafka Streams, and Schema Registry. Ability to independently diagnose and resolve complex production Kafka issues. Strong scripting experience using Python, Bash, or equivalent technologies. Experience implementing Infrastructure-as-Code using Terraform or similar tools. Experience building and supporting CI/CD pipelines and automation frameworks. Strong knowledge of observability, monitoring, logging, and distributed tracing solutions. Experience implementing monitoring strategies for Kafka platforms and distributed systems. Strong understanding of Site Reliability Engineering (SRE) principles and practices. Experience defining and managing SLIs, SLOs, and error budgets. Experience leading incident response, root cause analysis, and reliability improvement initiatives. Strong understanding of networking fundamentals including DNS, ports, latency, and load balancing. Strong Linux administration and troubleshooting experience. Knowledge of security principles, distributed systems architecture, and enterprise platform operations. Excellent communication, documentation, leadership, and stakeholder management skills. Ability to work independently and drive technical initiatives with minimal supervision. Preferred Qualifications Experience integrating Kafka platforms with enterprise observability tools such as Dynatrace. Experience supporting large-scale enterprise streaming and event-driven architectures. Experience mentoring engineering teams and leading operational readiness initiatives. Experience working with managed service providers and cross-functional infrastructure teams. Experience designing highly available and resilient cloud-based data streaming platforms. Must Skills Apache Kafka SRE Discipline (SLI / SLO / Error Budgets) Automation & Infrastructure as Code Observability & Monitoring Excellence Education: Bachelors Degree

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: compun
Position Id: PATDC5832690
Posted 5 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Consulting Engineer

Remote

•

Yesterday

Title: Consulting Engineer Location: Multiple location Job Description: Role Summary The Consulting Engineer (CE) is a customer-facing technical expert responsible for designing, building, and operating data-in-motion solutions on Confluent. Consulting Engineers focus on deep technical mastery in event streaming, working hands-on across development and operations to implement best-practice architectures, accelerate delivery, and help customers become self-sufficient on the Confluent platform. T

Easy Apply

Contract

Depends on Experience

Confluent Kafka Architect with Flink

Remote or Arkansas

•

Today

Job Description - Role: Confluent Kafka Architect with Flink Location: Remote DevOps-Middleware Collaboration and Migration Support OCC DevOps and middleware teams with Deployment of FLINK environments as well as VVP -> FLINK Migration: Knowledge of Ververika ("VVP") Platform and Flink as we are migrating off VVP to Confluent Flink. Flink application engineering - understanding existing jobs, dependencies, statefulness, Flink versions, jars, and any code/library changes needed during migra

Third Party, Contract

Kafka Admin - Fulltime

New Jersey

•

9d ago

Job Role: Kafka Admin Location: Chandler, AZ - Hybrid Fulltime Job Description Hands-on experience in developing and testing automation using Terraform and Python/Ansible. Strong operational experience with Kafka, including deployment, configuration, and troubleshooting. Working knowledge of Linux/UNIX shell scripting for automation and system operations. Extensive experience with Terraform in enterprise-scale Infrastructure as Code environments. Strong expertise in Ansible, including writing

Easy Apply

Full-time

Principal Systems Engineer

Hybrid in Tampa, Florida

•

Today

Are you ready to make an impact at DTCC? Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development? At DTCC, we are at the forefront of innovation in the financial markets. We are committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world

Full-time

Search all similar jobs

More jobs at Compunnel Inc. in Orlando, FLORIDA