Position: Kafka Operations Administrator
Location: Seattle, WA/St. Louis, Mo / Plano TX
Duration: Fulltime
Job Description:
Must Have Technical/Functional Skills
- Production-grade Apache Kafka operations experience, managing, maintaining and upgrading Kafka clusters in production environments with a focus on high availability, disaster recovery, fail-over and overall reliability
- Proficiency in installing and configuring monitoring systems using Grafana (building dashboards), Prometheus, Splunk, JMX metrics.
- Automation and orchestration experience: Terraform, Ansible, Helm, Kubernetes (EKS/AKS/GKE).
- Strong Linux system administration experience, including troubleshooting, automation and scripting for efficient infrastructure management.
- Experience in Production Support (ITIL processes followed) and participating in 24x7 on-call rotations, documenting incidents/postmortems.
- Experience in supporting JVM tuning, Analysis, network and disk I/O diagnostics
- Experience in TCP/IP, routing, switching and firewall configurations relevant to Kafka operations
Good to Have:
- Deep Kafka performance tuning and capacity planning experience
- Knowledge of message delivery semantics and guarantees (at-least-once, exactly-once)
- Cloud-native security/compliance experience (IAM, VPC, KMS, Security Groups)
- Certifications: Confluent Certified Administrator, AWS/Azure/Google Cloud Platform certifications
- Experience with Apache Kafka in KRaft mode, including set up, configuration, troubleshooting and cluster management
- Containerization and Container Orchestration Tools experience: Docker, Kubernetes
- Experience with CI/CD pipelines and Git-based workflows
- Experience building custom Kafka connect libraries and understanding of data serialization formats (eg: Avro, JSON)
- Knowledge of networking concepts across on-prem VMs and cloud environments, ensuring seamless integration and communication between services.
- Strong understanding of topic management and security best practices for streaming platforms: TLS, ACLs, RBAC, encryption at rest/in transit
- Kafka ecosystem tooling experience: Kafka Connect, Schema Registry
Role and Responsibilities
- Deploy, configure and manage Kafka clusters and related services to meet SLA requirement
- Participate in 24x7 on-call rotation to respond to incidents, alerts, and escalations
- Triage, diagnose, and remediate production incidents; coordinate with stakeholders, developers and infrastructure teams
- Implement automation for provisioning, scaling, server/data backups, and disaster recovery
- Maintain monitoring, alerting thresholds, dashboards, and Kafka ecosystem health
- Harden Kafka deployments: configure TLS, ACLs, RBAC, encryption, and vulnerability remediation
- Perform routine maintenance: Kafka ecosystem upgrades (controllers, brokers, connect, and schema registry), rolling restarts, etc.
- Create and maintain runbooks, runbook automation, and post-incident reports
- Optimize performance and resource utilization; benchmark and tune clusters
- Support Kafka Connect/Schema Registry service and troubleshoot connector issues
- Contribute to CI/CD pipeline improvements for infrastructure and deployment automation
Tekshapers is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information, or any characteristic protected by law.