Overview
Skills
Job Details
Hi
Hope you are doing well !!
I have an urgent position. Kindly go through the Job description and let me know if this would be of interest to you.
Title : Site Reliability Engineer (Hybrid)
Duration : 6 Months
Location : San Jose, CA
About the job
Responsibilities & Required Skills/Experience:
- NVIDIA (DGX) A100/ H100/ H200
- Cisco UCS-C885A
- Docker
- NVIDIA certificated professionals preferred
- Infrastructure knowledge on above skills
- DevOps Automation
- CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
- Terraform, Ansible, Jenkins
- Python
- Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos
- AI Infrastructure SRE Engineer responsible for
Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure
by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
Automate operational capabilities using Python, Ansible, Terraform, Go etc.
Deliver automation through CI/CD pipeline and chatbot etc.
Implement metrics driven processes to ensure service quality targets are met.
If you are interested, please share your updated resume and suggest the best number & time to connect with you
Thanks & Regards
|
| |||
|