Overview
On Site
Depends on Experience
Full Time
No Travel Required
Unable to Provide Sponsorship
Skills
Performance Monitoring
DevOps
Incident Management
MySQL
Kubernetes
Grafana
Job Details
Role Overview:
Avanciers are seeking skilled Site Reliability Engineers (SREs) with hands-on experience managing native servers and infrastructure in data center environments. The ideal candidate will have strong expertise in automation, observability, and performance optimization, ensuring reliable and scalable infrastructure operations.
Key Responsibilities:
- Uphold SLAs: Monitor, maintain, and enforce Service Level Agreements (SLAs) for critical engineering services. Implement robust alerting, monitoring, and incident response workflows to meet defined performance metrics.
- Incident Management: Conduct root cause analysis, post-incident reviews, and drive continuous improvements to prevent recurrence of failures.
- Observability:
- Configure and manage Prometheus, Grafana, and ELK Stack for performance monitoring and log analysis.
- Develop KPI dashboards and automation pipelines using Jenkins, Python, and ELK.
- Enhance monitoring systems with custom alerts aligned with business objectives.
- Automation & Optimization:
- Design and maintain automation scripts in Python, Go, or Bash.
- Support capacity planning and optimize infrastructure utilization.
- Operational Support:
- Respond to and resolve production incidents.
- Participate in WAR rooms during critical system outages or performance degradation events.
- Collaboration & Documentation:
- Maintain clear and up-to-date documentation for operational procedures, configurations, and troubleshooting workflows.
- Collaborate closely with engineering, DevOps, and IT infrastructure teams.
Technical Skills Required:
- Bare-metal Data Center Tools: IPMI, Redfish, KVM, etc.
- Automation: Jenkins, Python, Go, Bash
- Infrastructure & Monitoring Tools: Kubernetes, MySQL, Prometheus, Grafana, ELK
- Nice to Have: Familiarity with NVIDIA hardware, GPUs, or Tegra platforms
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.