Overview
Skills
Job Details
Role: SRE/Triage roles
Job Description:
Monitor production commerce applications to proactively identify issues and ensure high availability.
Perform first-level triage and validation of production incidents, assessing impact and urgency.
Analyze and interpret application and infrastructure logs (ELK, Dynatrace, Kubernetes) to isolate and diagnose problems.
Collaborate closely with development and platform teams to escalate and resolve issues efficiently.
Maintain observability dashboards and alerts; fine-tune thresholds for optimal signal-to-noise ratio.
Contribute to root cause analysis (RCA) and post-incident reviews to improve system resiliency.
Document triage runbooks, known issues, and SOPs for faster recovery cycles.
Support performance tuning, service availability metrics, and reliability improvement initiatives.
Required Skills and Experience:
Experience in system reliability, production support, or application monitoring for large-scale enterprise systems.
Familiarity with microservices and API-driven ecosystems.
Strong proficiency with ELK Stack, Dynatrace, Kubernetes observability tools.
Working knowledge of Java-based application architectures and Cassandra database operations.
Experience with Azure monitoring tools and Kafka monitoring for distributed systems.
MuleSoft monitoring experience is a valuable optional skill.
Familiarity with CI/CD pipelines, automated alerting, and reliability testing frameworks.
Demonstrated experience with production triaging, log analysis, and root cause identification.
Excellent communication skills and ability to collaborate across teams.