Overview
Skills
Job Details
As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.
Main Objectives
- Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code
- Implement monitoring, alerting, and disaster recovery systems to ensure high availability
- Collaborate with data science teams on ML pipelines and infrastructure
- Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security
Responsibilities
- Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies
- Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus
- Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers
- Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc
- Actively track and optimize cloud costs across all environments
- Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow
- Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups
- Document critical infrastructure configurations and operational procedures
Requirements
- 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering
- Expert-level knowledge of AWS services and architecture with strong coding skills
- Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)
- Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment
- Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices
- Fully proficient with Git and agile development practices
- Exceptional troubleshooting skills and cost optimization experience
- Demonstrated ability to lead projects and initiatives in a fast-paced environment
Bonus Points
- Experience with MLOps infrastructure (Argo, Metaflow, Airflow)
- Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)
- Strong Python backend engineering skills with ability to contribute to application development tasks
- Prior experience in fast-moving startups
- Domain experience in life sciences or biotechnology
- Enterprise networking experience (Palo Alto firewalls, Juniper switches/APs)
location: Durham, North Carolina
job type: Permanent
salary: $120,000 - 140,000 per year
work hours: 9am to 6pm
education: Bachelors
responsibilities:
As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.
Main Objectives
- Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code
- Implement monitoring, alerting, and disaster recovery systems to ensure high availability
- Collaborate with data science teams on ML pipelines and infrastructure
- Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security
Responsibilities
- Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies
- Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus
- Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers
- Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc
- Actively track and optimize cloud costs across all environments
- Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow
- Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups
- Document critical infrastructure configurations and operational procedures
Requirements
- 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering
- Expert-level knowledge of AWS services and architecture with strong coding skills
- Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)
- Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment
- Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices
- Fully proficient with Git and agile development practices
- Exceptional troubleshooting skills and cost optimization experience
- Demonstrated ability to lead projects and initiatives in a fast-paced environment
Bonus Points
- Experience with MLOps infrastructure (Argo, Metaflow, Airflow)
- Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)
- Strong Python backend engineering skills with ability to contribute to application development tasks
- Prior experience in fast-moving startups
- Domain experience in life sciences or biotechnology
- Enterprise networking experience (Palo Alto firewalls, Juniper switches/APs)
qualifications:
As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.
Main Objectives
- Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code
- Implement monitoring, alerting, and disaster recovery systems to ensure high availability
- Collaborate with data science teams on ML pipelines and infrastructure
- Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security
Responsibilities
- Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies
- Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus
- Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers
- Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc
- Actively track and optimize cloud costs across all environments
- Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow
- Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups
- Document critical infrastructure configurations and operational procedures
Requirements
- 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering
- Expert-level knowledge of AWS services and architecture with strong coding skills
- Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)
- Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment
- Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices
- Fully proficient with Git and agile development practices
- Exceptional troubleshooting skills and cost optimization experience
- Demonstrated ability to lead projects and initiatives in a fast-paced environment
Bonus Points
- Experience with MLOps infrastructure (Argo, Metaflow, Airflow)
- Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)
- Strong Python backend engineering skills with ability to contribute to application development tasks
- Prior