Principal Kafka Support & Reliability Engineer

Canton, MA, US • Posted 6 days ago • Updated 12 minutes ago
Full Time
Part Time
On-site
Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

  • Mergers and Acquisitions
  • Incident Management
  • ISR
  • Scalability
  • ROOT
  • IO
  • Network
  • Replication
  • Onboarding
  • Apache ZooKeeper
  • Recovery
  • Failover
  • Testing
  • Documentation
  • Corrective And Preventive Action
  • Continuous Improvement
  • Knowledge Base
  • Durable Skills
  • Tier 3
  • Performance Engineering
  • Accountability
  • Root Cause Analysis
  • Tier 2
  • Cloud Computing
  • Streaming
  • Backbone.js
  • SANS
  • Apache Kafka
  • Amazon Web Services
  • Kubernetes

Summary

Role: Principal Kafka Support & Reliability Engineer

Location: Canton, MA

Role Descriptions: Tier 3 Incident Management Escalation SupportAct as the highest technical escalation point for Kafka production incidents Sev 1 Sev 2.Lead deep troubleshooting across 1. Broker instability| controller elections| ISR shrinkage2. Under replicated partitions and leader imbalance3. Producerconsumer failures| lag spikes| and rebalance stormsDisk| network| JVM| and request handler saturationProvide hands on remediation for complex issues| including Partition reassignment and leader rebalanceBroker configuration tuningThrottlequota strategies for noisy producers or consumersCoordinate with vendor support during service incidents| providing logs| metrics| and forensic details.Guide Tier 2 teams during major incidents and validate restoration actions.2. Kafka Performance Engineering OptimizationAnalyze Kafka workloads for performance and scalability risks Partition skew and hot partitionsInefficient producer batchingcompressionConsumer lag root cause analysisThread pool| IO| and network bottlenecksRecommend and validate Topic design (partition count| replication factor| retention| compaction)Producer and consumer configuration best practicesQuotas| quotas enforcement| and multi tenant controlsSupport onboarding of high throughput or latency sensitive workloads| ensuring Kafka is correctly sized and tuned.3. Platform Stability| Reliability ResilienceDiagnose and resolve systemic Kafka stability issues Repeated broker failures or flappingMetadatacontroller instability (Zookeeper or KRaft)Recovery issues following failovers or maintenance eventsSupport resilience initiatives Multi AZ cluster health validationReplication and DR strategies (MirrorMaker 2| Replicator| or app level DR patterns)Failover testing and validationDefine and improve Kafka SLOs for availability| durability| and latency.4. Change| Upgrade Configuration LeadershipLead medium to high risk Kafka changes| including Broker and cluster configuration changesPartition expansion or large scale reassignmentTopic policy changes impacting durability or performanceSupport and plan Kafka version upgradesMSK Confluent upgrade cyclesClient compatibility and rollout strategiesParticipate in CAB reviews| assess risk| and design rollback and validation plans.5. Root Cause Analysis Continuous ImprovementOwn RCA documentation for major incidents with clear corrective and preventive actions (CAPA).Identify recurring failure patterns and architectural gaps.Recommend platform-level improvements Automation opportunitiesGuardrails and standardsMonitoring and alerting enhancementsContribute to continuous improvement of runbooks| knowledge base articles| and operational playbooks.

Essential Skills: Role OverviewThe Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support| advanced troubleshooting| performance engineering| and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA)| complex remediation| and long term prevention. The engineer works closely with Tier 2 operations| Platform Engineering| SRE teams| application teams| and vendor support (AWS MSK Confluent Cloud providers) to ensure Kafka remains a highly reliable| scalable| and secure streaming backbone.

Desirable Skills:

Keyword:

Skills: Digital : Kafka~Digital: Amazon Connect~Digital : Kubernetes Experience Required: 10 & Above

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91018020
  • Position Id: PDT - 10829-11964-1775572460
  • Posted 6 days ago

Company Info

About Purple Drive Technologies LLC

Founded in 2007, Purple Drive started as a tech solutions firm and has grown into a full-service consulting and talent partner. We help businesses navigate complex technology challenges while connecting top professionals with career-defining opportunities.

We believe in transforming businesses through smart IT solutions and empowering technologists to grow their expertise through challenging projects and meaningful partnerships. Built on over 20 years of trusted relationships, we create success stories for both our clients and the talented professionals who drive innovation forward.

Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Quincy, Massachusetts

Today

Contract

USD 125,000.00 - 135,000.00 per year

Quincy, Massachusetts

Today

Full-time

USD 70,000.00 - 118,750.00 per year

Quincy, Massachusetts

Today

Full-time

USD 170,000.00 - 282,500.00 per year

Quincy, Massachusetts

Yesterday

Full-time

USD 170,000.00 - 282,500.00 per year

Search all similar jobs