Site Reliability Engineer

Irvine, CA, US • Posted 21 hours ago • Updated 21 hours ago
Contract Independent
Contract W2
Contract Corp To Corp
No Travel Required
On-site
Depends on Experience
Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

  • (AWS) AND (Kubernetes) AND (Terraform) AND (CI/CD) AND ("Site Reliability Engineer" OR "SRE")

Summary

Role: Site Reliability Engineer (SRE) – Product Platform Team
Duration: Initial 3–6 months, budgeted up to 1-year, potential extension or conversion
Location: Irvine, CA- Onsite, 5 days per week
 
 
Role Overview:
 
As a Site Reliability Engineer, you will be immersed in a high-performing and frequently challenged team responsible for the reliability, scalability, and operational excellence of client’s product platform. You will work and consult closely with engineering and product teams to design, build, and operate resilient platform services, implement automation and reliability frameworks, and ensure the availability and performance of shared platform capabilities. You will also be responsible for the ongoing operation, monitoring, and continuous improvement of platform infrastructure and services that support all product teams.
 
Responsibilities:
Must Have:
 
Cloud Architecture & Infrastructure Engineering

•    Architect and maintain multi-cloud infrastructure (AWS and Google Cloud Platform and Azure) to support enterprise-scale healthcare operations.
•    Implement Infrastructure-as-Code (IaC) using Terraform, Helm charts, and CloudFormation to automate resource provisioning and ensure consistency across environments.
•    Configure, manage, and deploy Kubernetes clusters and cloud native tooling according to best practices for scaling, resiliency, and reliability.
•    Manage and optimize multi-cluster Kubernetes environments, utilizing Istio service mesh for advanced traffic management, service discovery, and observability.
•    Enforcing security standards and SAST, DAST, code quality scans on the Kubernetes cluster and application pipelines and containers and remediating any security findings related to the platform.
•    Design solutions that are aligned with business value in regard to TCO and ROI.
•    Design cross-region disaster recovery strategies and rolling deployment architectures to ensure high availability and business continuity for mission-critical applications.
 
DevOps & Automation

•    Develop and maintain CI/CD pipelines using BitBucket Pipelines and ArgoCD, leveraging GitOps practices for automated testing and deployment.
•    Develop and implement database deployment pipelines using Liquibase
•    Create automation for infrastructure and application configuration using Ansible
•    Implement automated health validation and failover capabilities to facilitate zero-downtime updates.
•    Provide technical mentorship to engineering teams on event-driven architecture, streaming best practices, and DevOps methodologies.
 
Security, Compliance & Observability

•    Design comprehensive monitoring and observability solutions using Prometheus, Grafana, and OpenTelemetry for distributed tracing and system performance visibility.
•    Implement robust security architectures using Conjur Cloud for automated credential rotation and default-deny network policies.
•    Configure advanced authentication mechanisms, including SASL/SSL with SCRAM-SHA-512, and manage RBAC permissions for database clusters.
•    Ensure infrastructure adherence to regulatory compliance frameworks (HIPAA, SOC 2, ISO 27001) through automated policy enforcement and encrypted communication channels.
•    Building "Golden Signal" dashboards (Latency, Traffic, Errors, Saturation) that automatically populate for every new microservice.
•    Create automation to reduce TOIL and auto healing for remediations within monitoring
•    Track application and platform costs from all layers and optimize costs on the platform.
 
Nice to Have:

•    Event Streaming & Data Processing
•    Architect real-time stream processing applications using Apache Flink with high-availability configurations, incorporating RocksDB state management and persistent storage.
•    Build and manage event streaming infrastructure using Apache Kafka (Amazon MSK) and Kafka Connect, ensuring seamless data ingestion and integration.
•    Implement Debezium CDC (Change Data Capture) for real-time synchronization of MongoDB data streams.
•    Manage Confluent Schema Registry to support Avro, JSON, and Protobuf schemas, ensuring backward compatibility across data pipelines.
•    Automate data pipelines and orchestration using Dagster and cloud-native data services.
MLOps, AI Infrastructure
•    Design and implement AI platform architecture to support MCP and agents using cloud native tools and security best practices
•    Establish and maintain MLOps pipelines using MLflow and Kubeflow to support the training, deployment, and monitoring of machine learning models.
•    Design infrastructure to support Generative AI (Gen-AI) and Large Language Models (LLM), specifically regarding Retrieval-Augmented Generation (RAG) implementations.
•    Create monitoring systems around AI applications to ensure performance and accuracy
•    Configure and deploy vector database to maintain scalability and indexing performance for RAG based AI workloads
•    Implement security and guardrails around AI workloads
 
Qualifications:
Required:
 
•    BS degree in Computer Science or related field plus 4 years of relevant technology experience or equivalent combination of education & education & experience in lieu of degree, 4+ years of relevant expertise is required.
•    Hands-on experience supporting production-grade cloud infrastructure in at least one major cloud provider (AWS, Google Cloud Platform, or Azure).
•    Practical experience operating and maintaining Kubernetes-based platforms in production environments.
•    Experience with Infrastructure as Code (IaC) tools such as Terraform, Helm, or CloudFormation.
•    Working knowledge of CI/CD and GitOps practices, including automated testing and deployment pipelines.
•    Experience implementing or supporting monitoring, alerting, and observability solutions (metrics, logs, traces).
•    Strong troubleshooting skills across distributed systems, including performance, availability, and reliability issues.
•    Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash).
•    Experience participating in on-call rotations, incident response, and root cause analysis
 
Preferred:

•    Experience operating multi-cloud environments (AWS, Google Cloud Platform, Azure).
•    Experience with event streaming platforms such as Apache Kafka, Kafka Connect, or managed services (e.g., Amazon MSK).
•    Familiarity with service mesh technologies (e.g., Istio) and advanced traffic management patterns.
•    Exposure to stream processing frameworks (e.g., Apache Flink) and CDC tools such as Debezium.
•    Experience supporting MLOps or AI infrastructure, including ML pipelines, model deployment, or GenAI workloads.
•    Familiarity with observability standards such as OpenTelemetry and Golden Signals (Latency, Traffic, Errors, Saturation).
•    Experience working in regulated environments and supporting compliance frameworks (HIPAA, SOC 2, ISO 27001).
•    Experience implementing security best practices for cloud-native platforms (IAM, secrets management, RBAC).
•    Prior experience in platform engineering or internal developer platforms.
•    Exposure to cost optimization and FinOps practices in cloud environments.
 
Knowledge/Skills/Abilities:

•    Ability to multi-task effectively without compromising the quality of the work.
•    Excellent interpersonal, oral and written communication skills.
•    Detail oriented, organized, process focused problem solver, proactive, ambitious, customer service focused.
•    Ability to draw conclusions and make independent decisions with limited information.
•    Ability to respond to common inquiries from customers, staff, regulatory agencies, vendors and other members of the business community.
•    Self-motivated, reliable individual capable of working independently as well as part of the team.
•    Motivated to drive improvement in a challenging environment.
•    Strong background in data structures, algorithms and debugging.
•    Demonstrated technical leadership, and successful participation in projects involving multiple engineers.
•    Ability to learn quickly, understand complex systems and to work closely with others across multiple teams.
•    Ability to handle uncertainty, time pressure and large technical challenges.
•    Ability to deliver high-quality work on time.
•    Strong attention to details, highly organized, computer literate
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10113037
  • Position Id: 528-14401-
  • Posted 21 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Irvine, California

Today

Easy Apply

Full-time

$$65 - $70/hr

Costa Mesa, California

Today

Full-time

USD 166,000.00 - 220,000.00 per year

Hawthorne, California

Today

Full-time

USD 160,000.00 - 220,000.00 per year

Hawthorne, California

Today

Full-time

USD 160,000.00 - 220,000.00 per year

Search all similar jobs