Role: Site Reliability Engineer (SRE)
Location: Palo Alto, CA (Onsite from Day 1)
Job Type: Contract (W2)
Skill Matrix:
| Name | Required |
| Programming | Yes |
| SRE | Yes |
| Grafana | Yes |
| Prometheus | Yes |
| AWS | Yes |
| Cloud Infrastructure | Yes |
| Linux | Yes |
| UNIX | Yes |
Top skills required for this role:
Programming: Proficiency in languages like Python, Java, or Go.
System Administration: Strong understanding of Linux/Unix systems.
Cloud Infrastructure: Experience with AWS
Infrastructure as Code (IaC): Knowledge of tools like Terraform or Ansible.
Monitoring Tools: Proficiency with tools such as Prometheus, Grafana, or Datadog
Job Description/ Responsibilities:
Automation and Tooling: SREs write code to automate operational tasks, such as provisioning, configuration changes, and system updates to reduce manual work and human error.
System Monitoring and Alerting: Developing and maintaining observability stacks (logs, metrics, tracing) to proactively detect issues before they impact users.
Incident Response and On-Call: Managing 24/7 on-call rotation to respond to, troubleshoot, and resolve production incidents.
Post-Incident Reviews (Postmortems): Conducting blameless, in-depth reviews of incidents to identify root causes and implement preventive measures.
Capacity Planning: Analyzing system resource utilization to ensure infrastructure can scale to handle future load requirements.
Performance Optimization: Identifying and fixing bottlenecks in software and infrastructure to improve system efficiency and responsiveness.
Error Budget Management: Setting and managing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to determine if a service is reliable enough to allow new feature deployments.
Chaos Engineering: Testing system resilience by intentionally introducing failures to ensure systems are fault-tolerant
Years of Experience: 8+ Years of Experience