Sr. Reliability Engineer (26861)

San Jose, CA, US • Posted 60+ days ago • Updated 10 hours ago

Full Time

On-site

USD $145,000.00 - 165,000.00 per year

Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

Computer Networking
Apache Hadoop
Big Data
HPC
IoT
Embedded Systems
High Availability
Scalability
SAN
ProVision
IaaS
Terraform
Ansible
Cloud Computing
Data Storage
Ceph
Weka
Capacity Management
Forecasting
Root Cause Analysis
Service Level
DevOps
Machine Learning Operations (ML Ops)
GitLab
Continuous Integration
Continuous Delivery
Regulatory Compliance
Access Control
RBAC
LDAP
SSO
TLS
Network
Documentation
Incident Management
Knowledge Transfer
Onboarding
Computer Science
Linux
Ubuntu
Red Hat Enterprise Linux
CentOS
Docker
Orchestration
Kubernetes
Management
CUDA
Grafana
Scripting
Bash
Python
Network Protocols
Dragon NaturallySpeaking
DNS
DHCP
Border Gateway Protocol
InfiniBand
Ethernet
Collaboration
Communication
Machine Learning (ML)
Workflow
Storage
Artificial Intelligence
Provisioning
PXE
GPU
Testing
Benchmarking
ITIL
Change Management
Linux+
Training
Forms

Summary

Job Req ID: 26861

About Supermicro:

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary:

As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties and Responsibilities:

Includes the following essential duties and responsibilities (other duties may also be assigned):

Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations.
Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance.
Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation.
Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference.
Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals.
CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools.
Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies.
Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field-or equivalent experience and 8 years of experience in the areas below
Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
Strong scripting and coding skills (Bash, Python, or Go).
Exposure to secure multi-tenant environments and zero trust architectures
Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
Excellent collaboration and communication skills for cross-team, partner, and customer initiatives

Preferred Qualifications:

Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow.
Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS).
Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman.
Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking.
Familiarity with ITIL processes or structured change management in production systems is a plus.
Certifications: CKA, CKAD, Linux+, or related credentials

Salary Range

$145,000 - $165,000

The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs.

EEO Statement

Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10172660
Position Id: f05e7a698b7486d8408310862bad2939
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Santa Clara, California

•

Today

Senior Principal Software Engineer The Software Engineering team delivers next-generation software application enhancements and new products for a changing world. Working at the cutting edge, we design and develop software for platforms, peripherals, applications and diagnostics - all with the most advanced technologies, tools, software engineering methodologies and the collaboration of internal and external partners. Join us to do the best work of your career and make a profound social impact

Full-time

USD 239,700.00 - 310,200.00 per year

Principal, Design Engineering

San Jose, California

•

Today

Req ID: 127828 Region: Americas Country: USA State/Province: California City: San Jose General Overview Job Title: Principal, Design Engineering Functional Area: Engineering (ENG) Career Stream: Engineering (ENG) Role: Principal (PRI) Job Code: PRI-ENG-DSGN Job Band: 12 Direct/Indirect Indicator: Indirect Summary Celestica is expanding its team and seeking talented, passionate Principal Software Engineers to contribute to our next-generation data center networking, and AI compute b

Full-time

Compensation information provided in the description

Senior Principal Software Engineer (Platform & Infrastructure)

Santa Clara, California

•

Today

Our Mission At Palo Alto Networks , we're united by a shared mission-to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you're ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you're in the right place. Who We Are In order to be the cybersecurity partner of choice, w

Full-time

USD 225,000.00 - 250,500.00 per year

Member of Technical Staff, Site Reliability Engineer (HPC) - MAI SuperIntelligence Team

Mountain View, California

•

Today

Overview As Microsoft continues to push the boundaries of AI, we are on the lookout for passionate individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is bold and broad - to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It's also inclusive: we aim to make AI accessible to all - consumers, businesses, developers - so that everyone can realize its benefits. We're looking for an

Full-time

USD 139,900.00 - 274,800.00 per year

Search all similar jobs

Sr. Reliability Engineer (26861)

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs