Sr. Site Reliability Engineer

Overview

On Site
USD120,000 - USD140,000
Full Time

Skills

Sr. Site Reliability Engineer

Job Details

job summary:

As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.



Main Objectives




  • Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code


  • Implement monitoring, alerting, and disaster recovery systems to ensure high availability


  • Collaborate with data science teams on ML pipelines and infrastructure


  • Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security



Responsibilities




  • Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies


  • Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus


  • Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers


  • Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc


  • Actively track and optimize cloud costs across all environments


  • Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow


  • Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups


  • Document critical infrastructure configurations and operational procedures



Requirements




  • 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering


  • Expert-level knowledge of AWS services and architecture with strong coding skills


  • Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)


  • Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment


  • Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices


  • Fully proficient with Git and agile development practices


  • Exceptional troubleshooting skills and cost optimization experience


  • Demonstrated ability to lead projects and initiatives in a fast-paced environment



Bonus Points




  • Experience with MLOps infrastructure (Argo, Metaflow, Airflow)


  • Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)


  • Strong Python backend engineering skills with ability to contribute to application development tasks


  • Prior experience in fast-moving startups


  • Domain experience in life sciences or biotechnology


  • Enterprise networking experience (Palo Alto firewalls, Juniper switches/APs)





location: Durham, North Carolina

job type: Permanent

salary: $120,000 - 140,000 per year

work hours: 9am to 6pm

education: Bachelors



responsibilities:

As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.



Main Objectives




  • Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code


  • Implement monitoring, alerting, and disaster recovery systems to ensure high availability


  • Collaborate with data science teams on ML pipelines and infrastructure


  • Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security



Responsibilities




  • Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies


  • Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus


  • Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers


  • Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc


  • Actively track and optimize cloud costs across all environments


  • Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow


  • Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups


  • Document critical infrastructure configurations and operational procedures



Requirements




  • 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering


  • Expert-level knowledge of AWS services and architecture with strong coding skills


  • Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)


  • Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment


  • Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices


  • Fully proficient with Git and agile development practices


  • Exceptional troubleshooting skills and cost optimization experience


  • Demonstrated ability to lead projects and initiatives in a fast-paced environment



Bonus Points




  • Experience with MLOps infrastructure (Argo, Metaflow, Airflow)


  • Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)


  • Strong Python backend engineering skills with ability to contribute to application development tasks


  • Prior experience in fast-moving startups


  • Domain experience in life sciences or biotechnology


  • Enterprise networking experience (Palo Alto firewalls, Juniper switches/APs)





qualifications:

As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.



Main Objectives




  • Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code


  • Implement monitoring, alerting, and disaster recovery systems to ensure high availability


  • Collaborate with data science teams on ML pipelines and infrastructure


  • Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security



Responsibilities




  • Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies


  • Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus


  • Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers


  • Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc


  • Actively track and optimize cloud costs across all environments


  • Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow


  • Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups


  • Document critical infrastructure configurations and operational procedures



Requirements




  • 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering


  • Expert-level knowledge of AWS services and architecture with strong coding skills


  • Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)


  • Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment


  • Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices


  • Fully proficient with Git and agile development practices


  • Exceptional troubleshooting skills and cost optimization experience


  • Demonstrated ability to lead projects and initiatives in a fast-paced environment



Bonus Points




  • Experience with MLOps infrastructure (Argo, Metaflow, Airflow)


  • Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)


  • Strong Python backend engineering skills with ability to contribute to application development tasks


  • Prior


Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.