Overview
Skills
Job Details
Job Title: SRE Consultant with cloud data centers experience is mandatory
Location: Santa Clara, CA (Onsite 5 days a week)- try to submit Locals only
Onsite Requirement Yes
Number of days onsite 5 Days
Mandatory Areas
Must Have Skills
Skill 1 Manage Nvidia s on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data Centers.
Skill 2 Maintain KPI pipelines using Jenkins, Python and ELK.
Skill 3 Baremetal data centre machine management tools like IPMI, Redfish, KVM
cloud data centers experience is mandatory
o Any familiarity with Nvidia hardware like GPU & Tegras is a plus
Good To have Skills
Skill 1 Automation using Jenkins, Python, Go, Bash.
Requirements/Skills:
On-prem infrastructure management
o Manage Nvidia s on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.
Guard SLAs o Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches.
Observability
o Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.
o Improve monitoring systems by adding custom alerts based on business needs.
Automation & Optimization
o Help in capacity planning, optimization and better utilization efforts.
Day-to-Day Support
o Support user reported issues & issues. Monitor alerts and take necessary action.
o Actively participate in WAR room for critical issues
Collaboration & Documentation
o Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.
Tech stack
o Baremetal data center machine management tools like IPMI, Redfish, KVM etc.
o Automation using Jenkins, Python, Go, Bash.
o Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.