Overview
Skills
Job Details
Site Reliability Engineering
Kansas City, KS - Onsite for 4 days
Long-term Contract
Flexibility to work in 24 X 7 environment
Skills
Production support expertise with SRE Observability experience:
Proactive issue identification using observability tools.
Skills in using different monitoring & observability tools to track system performance
Production support activities including proactive identification of issues leveraging observability tools, Corelating inputs from various dashboards & tools to drive resolution
Experience in swiftly identifying probable failure points through the analysis of multiple inputs from the logs, observability dashboards, recent application changes, infra, network changes etc. Basic level of trouble shooting on every layer of the tech stack (Application, Database, Infra (Container platforms) and Network)
Experience in setting up observability dashboards based on Splunk logs
Communication:
Excellent communicator. They are also expected to actively lead and triage proactively identified issues/incidents where VPS/SVPs are also present in these call.
Leadership in triage calls- direct the teams for actions to be taken on the call
Automation:
Experience in Toil identification and automation
Technical expertise:
Analysis of issues via Splunk (including Splunk APM and Splunk 011y), AppDynamics, Grafana, RedMetrics, 1000Eyes
Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux/Unix
Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure
Analysis of issues via APM, NMON, Wireshark usage and analysis
Database performance monitoring and analysis
Experience in UEM and synthetic monitoring set up
Experience in heap dump analysis, memory leak analysis and resource optimization
Optional skills:
ServiceNow (including AIOps, tools for Self-Heal and automated playbooks)
Development experience in some of the technologies -Java, Python, AWS, Azure, Oracle, Cassandra, SQL Server, My SQL and Mongo DB