Overview
Skills
Job Details
Role: Senior Site Reliability Engineer
Locations: Richardson, TX / Raleigh, NC / Phoenix, AZ / Hartford, CT / Indianapolis, IN
Type of Hiring: FTE
Job Description:
Bachelor's degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education
At least 11 years of Information Technology experience
At least 6 years of Site reliability engineering (SRE) experience in large programs with focus on architecting and implementing observability, automation across the entire lifecycle of operations.
Observability & Monitoring: Implement logging, monitoring, and alerting using any one of Dynatrace, Datadog, Splunk, Nagios, Prometheus, Grafana, ELK stack, or New Relic.
Analyze monitoring data/ golden signals to identify trends and patterns and proactively address potential problems.
Engagement to debug, optimize code, and automate routine operational tasks
Improve automation and increase the system's self-healing capability
Incident Management: participate in production incidents, perform root cause analysis (RCA), and drive post-mortem improvements.
Develop and maintain dashboards and reports to visualize system health and performance.
Use various technologies such as: ansible, Python, terraform, Powershell/Shell, JSON, create automation to reduce toil in operations
Develop automation solutions for repeated incidents/ service tasks for provisioning, scaling, backup, performance management, security, capacity mgmt etc. for infrastructure operations - Or - Develop automation/optimization solutions for repeated tickets/ signals on application operations
Preferred Qualifications:
Working Knowledge of:
Troubleshooting and providing speedy solution in case of failure of the database.
SLI, SLO, error budgets.
Event correlation, AIOps with deep understanding of ITSM tools
Microservices architecture with API's and REST API's
CICD tooling and best practices
Cloud platforms such as AWS, Azure, and Google
Container orchestration and practices, including Kubernetes, Docker Swarm
Infrastructure automation tools like Terraform, Cloud Formation, Ansible, and Puppet (Any one)
Scripting Languages: any of the following: Python, JSON, Java, Node.JS, PHP, PowerShell(M) or Bash/Shell/Perl
ITSM tools such as: ServiceNow
Excellent Communications and client interaction skills along with exceptional written and verbal skills as well as technical documentation
Extraordinary Planning, Project Management, Coordination, and Analytical skills
Hands-on experience in working in Global Delivery Model with onsite/offshore resources
Exceptional Organizational Skills
Ability to manage and prioritize tasks efficiently
Readiness to demonstrate a proactive attitude
Solid attention to detail and excellent written and verbal communication skills are required
Ability to work in team in diverse/ multiple stakeholder environment