Locations: Seattle, WA / St. Louis, MO / TX / Any Nearby client location
Must Have Technical/Functional Skills
• Production-grade Apache Kafka operations experience, managing, maintaining and upgrading Kafka clusters in production environments with a focus on high availability, disaster recovery, fail-over and overall reliability
• Proficiency in installing and configuring monitoring systems using Grafana (building dashboards), Prometheus, Splunk , JMX metrics.
• Automation and orchestration experience: Terraform , Ansible, Helm, Kubernetes (EKS/AKS/GKE).
• Strong Linux system administration experience, including troubleshooting, automation and scripting for efficient infrastructure management.
• Experience in Production Support (ITIL processes followed) and participating in 24x7 on-call rotations , documenting incidents/postmortems.
• Experience in supporting JVM tuning, Analysis, network and disk I/O diagnostics
• Experience in TCP/IP, routing, switching and firewall configurations relevant to Kafka operations
Good to Have:
• Deep Kafka performance tuning and capacity planning experience
• Knowledge of message delivery semantics and guarantees (at-least-once, exactly-once)
• Cloud-native security/compliance experience (IAM, VPC, KMS, Security Groups)
• Certifications: Confluent Certified Administrator, AWS/Azure/Google Cloud Platform certifications
• Experience with Apache Kafka in KRaft mode, including set up, configuration, troubleshooting and cluster management
• Containerization and Container Orchestration Tools experience: Docker, Kubernetes
• Experience with CI/CD pipelines and Git-based workflows
• Experience building custom Kafka connect libraries and understanding of data serialization formats (eg: Avro, JSON)
• Knowledge of networking concepts across on-prem VMs and cloud environments, ensuring seamless integration and communication between services.
• Strong understanding of topic management and security best practices for streaming platforms: TLS, ACLs, RBAC, encryption at rest/in transit
• Kafka ecosystem tooling experience: Kafka Connect, Schema Registry
Role and Responsibilities
• Deploy, configure and manage Kafka clusters and related services to meet SLA requirement
• Participate in 24x7 on-call rotation to respond to incidents, alerts, and escalations
• Triage, diagnose, and remediate production incidents; coordinate with stakeholders, developers and infrastructure teams
• Implement automation for provisioning, scaling, server/data backups, and disaster recovery
• Maintain monitoring, alerting thresholds, dashboards, and Kafka ecosystem health
• Harden Kafka deployments: configure TLS, ACLs, RBAC, encryption, and vulnerability remediation
• Perform routine maintenance: Kafka ecosystem upgrades (controllers, brokers, connect, and schema registry), rolling restarts, etc.
• Create and maintain runbooks, runbook automation, and post-incident reports
• Optimize performance and resource utilization; benchmark and tune clusters
• Support Kafka Connect/Schema Registry service and troubleshoot connector issues
• Contribute to CI/CD pipeline improvements for infrastructure and deployment automation