Site Reliability Engineer (USC)

Overview

Remote
$140,000 - $160,000
Full Time
10% Travel

Skills

SRE
DevOps
AWS
Kubernetes
Azure
GCP

Job Details

Title: Site Reliability Engineer

Start: September 2025

Duration: FTE

Location: REMOTE

***Active Top Secret with SCI eligibility Required***

Role Overview

As a Site Reliability Engineer at Karthik, you will be responsible for designing, building, and maintaining scalable, reliable, and secure platforms for hybrid cloud environments and modern containerized applications.

This role emphasizes seamless deployment, efficient operations, and continuous platform improvement using your expertise in hybrid cloud, Kubernetes, site reliability engineering, and DevOps practices.

Responsibilities

  • Design and manage hybrid cloud architectures, integrating on-premises and cloud environments securely and efficiently.
  • Deploy and maintain Kubernetes clusters to support containerized workloads with scalability, reliability, and security.
  • Develop and manage infrastructure-as-code (IaC) solutions to automate and standardize infrastructure deployments (e.g., Terraform, Ansible).
  • Build and optimize CI/CD pipelines to streamline application deployment and testing workflows.
  • Implement robust monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, ELK) to enhance platform reliability and visibility.
  • Automate incident management processes, including root cause analysis and self-healing mechanisms, to improve platform stability.
  • Optimize cloud usage and costs while maintaining high performance and security standards.
  • Ensure compliance with security best practices for hybrid cloud and containerized environments.
  • Collaborate with development, security, and operations teams to align infrastructure with application requirements.
  • Stay updated on emerging technologies and trends to incorporate relevant advancements into the platform.
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
  • Planning for future capacity needs to ensure that systems can handle expected workloads.
  • Automating repetitive tasks and infrastructure management to reduce manual effort and improve efficiency.
  • Documenting work to turn findings into repeatable actions.

SRE Soft Skills:

  • Problem-Solving: The ability to quickly identify, diagnose, and resolve issues, especially during incidents, is critical.
  • Communication: Effective communication skills are essential for collaborating with developers, operations teams, and other stakeholders.
  • Collaboration: Our SRE will often work in cross-functional teams, so strong collaboration skills are essential.
  • Analytical Skills: Analyzing data, identifying trends, and making data-driven decisions is important for improving system reliability.
  • Attention to Detail: Our SREs must be meticulous to ensure that systems are configured and maintained correctly.
  • Continuous Learning: The field of SRE is constantly evolving, so it is important to commit to continuous learning and stay up-to-date with the latest technologies.

Required Skills and Qualifications:

  • Bachelor s degree in a STEM-related field
  • Certifications in Kubernetes (CKA, CKAD) or cloud platforms (AWS, Azure, Google Cloud Platform).
  • Minimum 4 years of experience in platform engineering, or site reliability roles.
  • You are a and possess a current Secret Clearance
  • Expertise in hybrid cloud platforms (AWS, Azure, Google Cloud Platform, or private cloud solutions).
  • Proficiency in Kubernetes, including cluster deployment, management, and scaling.
  • Strong knowledge of automation and scripting tools (e.g., Python, Bash).
  • Top Secret with SCI eligibility
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.