The Role
You will be a core member of the SRE team, responsible for the reliability, automation, and observability of the infrastructure that the company runs on. You will work across colocation, on-premises lab environments, and cloud platforms and you will own your systems end-to-end, from initial provisioning through live incident response.
You will partner with hardware and software development teams to support their workload needs, including CI/CD pipelines and automation layer and the associated CI/CD pipelines and automation layer for software toolling. You will also support customer-facing environments where d-Matrix partners collaborate on hardware and software deployments.
What You Will Do
Infrastructure Operations
Own reliability and availability of assigned infrastructure domains: colo server fleets, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and customer-facing platform services.
Perform hands-on infrastructure work: server provisioning, OS configuration, network setup, storage management, and hardware troubleshooting from bare metal up.
Conduct capacity planning and hardware lifecycle management for assigned infrastructure domains; track and report cloud spend for your domains to support FinOps and workload placement decisions.
Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible) for your infrastructure domains all provisioning and changes through code, not manual steps.
Build, deploy and document automation to eliminate toil: host lifecycle management, fleet health checks, auto-remediation workflows, and self-service tooling for engineering teams.
Develop networking automations for cluster interconnects, VLAN management, and lab network configurations.
Contribute to shared IaC modules and automation libraries used across the global SRE and data center services teams.
Observability & Incident Response
Design and maintain monitoring dashboards, alerting, and SLIs (PrometheGrafana, DataDog) for your infrastructure domains ensuring signal quality, actionable alerts, and contributing to AIOps-driven detection workflows that reduce time to detect and respond.
Participate in on-call rotation; triage and resolve incidents from bare metal to application layer using structured, AI-assisted workflows distinguishing infrastructure faults from software or hardware product issues and escalating with clear context.
Produce high-quality RCA reports for P0/P1 incidents with root cause analysis and tracked action items.
Detect performance issues, recommend solutions, and implement fixes that permanently improve system reliability.
Customer & Platform Development Services
Support and operate platform services used by both internal teams and external customers for hardware and software deployment collaboration with d-Matrix.
Ensure QoS and uptime commitments for customer-facing environments; escalate reliability risks proactively.
Document platform configurations, access procedures, and operational runbooks for customer environments.
Documentation & Collaboration
Maintain high-quality runbooks, architecture diagrams, and troubleshooting guides documentation is part of the job, not an afterthought.
Partner with the DevOps team to ensure infrastructure reliability supports CI/CD pipeline performance and developer experience.
Serve as a technical resource for engineering teams sharing operational knowledge and raising infrastructure risks early.
What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field (or equivalent experience); 5+ years in SRE, infrastructure engineering, or systems administration.
Strong Linux systems knowledge: networking, storage, systemd, package management, kernel parameters, and performance diagnostics.
Hands-on experience with colocation or on-premises server infrastructure physical hardware, rack networking, and bare-metal provisioning.
Hands-on experience deploying and operating AI-driven infrastructure tools AIOps platforms, intelligent alerting, anomaly detection, or LLM-assisted diagnostics in production environments.
IaC experience with Terraform and/or Ansible writing and maintaining production configurations, not just running existing playbooks.
Kubernetes operational experience: cluster troubleshooting, workload management, storage, and networking.
Prometheus + Grafana or DataDog: building dashboards, writing alert rules, and understanding signal quality.
Python and/or Bash scripting: production-quality automation, not just one-off scripts.
Incident response experience: structured triage, RCA production, and follow-through on action items.
Comfort operating in fast-moving startups: you own your systems, document what you build, and iterate without waiting for perfect requirements.
Strongly Preferred
Experience operating customer-facing infrastructure or platform services with external reliability expectations.
Cloud infrastructure operations across AWS, Azure, or Google Cloud Platform including hybrid environments spanning cloud and on-prem.
HPC job scheduler experience: Slurm, LSF, or equivalent operations and troubleshooting.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink configuration and troubleshooting.
Experience with large-scale infrastructure automation: host lifecycle management, fleet auto-healing, or AIOps-driven operations building tooling that reduces manual intervention, not just running it
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
- Dice Id: cxbcsi
- Position Id: Job44643
- Posted 4 days ago