Overview
Remote
$65+
Contract - Independent
Contract - W2
Contract - 12 Month(s)
Skills
API
Apache Cassandra
Apache Kafka
Collaboration
Communication
Continuous Delivery
Continuous Improvement
Continuous Integration
Dashboard
Dynatrace
High Availability
Innovation
Java
Kubernetes
IaaS
Log Analysis
Microservices
Microsoft Azure
MuleSoft
Performance Tuning
Production Support
ROOT
Recovery
Reliability Engineering
Root Cause Analysis
Job Details
Role Overview:
We are looking for an experienced Site Reliability / Triage Engineer with a strong background in production monitoring, incident triage, and performance optimization for enterprise-scale systems. The ideal candidate will be hands-on with observability tools, application logs, and cloud infrastructure to ensure system reliability and high availability.
Key Responsibilities:
- Monitor production commerce applications to proactively identify issues and maintain uptime.
- Perform first-level triage and validation of production incidents, assess impact and urgency.
- Analyze logs using ELK, Dynatrace, and Kubernetes to isolate and resolve issues.
- Collaborate with development and platform teams to escalate and resolve incidents quickly.
- Maintain and fine-tune observability dashboards and alerts for efficient signal-to-noise ratio.
- Contribute to Root Cause Analysis (RCA) and post-incident reviews for continuous improvement.
- Document runbooks, SOPs, and known issues for rapid recovery cycles.
- Support performance tuning and reliability improvement initiatives.
Required Skills & Experience:
- 10+ years of experience in system reliability, production support, or application monitoring.
- Strong understanding of microservices, API ecosystems, and Java-based architectures.
- Expertise in ELK Stack, Dynatrace, Kubernetes, and Azure monitoring tools.
- Experience with Cassandra, Kafka monitoring, and CI/CD pipelines.
- (Optional but Preferred) MuleSoft monitoring experience.
- Proven track record in triaging, log analysis, and root cause identification.
- Excellent communication and collaboration skills across global teams.
Why Join Us:
- Long-term engagement with a leading enterprise project.
- Collaborative, tech-driven environment focused on reliability and innovation.
- Opportunity to work on modern observability and automation frameworks.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.