Site Reliability Engineer

Irvine, CA, US • Posted 21 hours ago • Updated 21 hours ago

Contract Independent

Contract W2

Contract Corp To Corp

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

(AWS) AND (Kubernetes) AND (Terraform) AND (CI/CD) AND ("Site Reliability Engineer" OR "SRE")

Summary

Role: Site Reliability Engineer (SRE) – Product Platform Team
Duration: Initial 3–6 months, budgeted up to 1-year, potential extension or conversion
Location: Irvine, CA- Onsite, 5 days per week

Role Overview:

As a Site Reliability Engineer, you will be immersed in a high-performing and frequently challenged team responsible for the reliability, scalability, and operational excellence of client’s product platform. You will work and consult closely with engineering and product teams to design, build, and operate resilient platform services, implement automation and reliability frameworks, and ensure the availability and performance of shared platform capabilities. You will also be responsible for the ongoing operation, monitoring, and continuous improvement of platform infrastructure and services that support all product teams.

Responsibilities:
Must Have:

Cloud Architecture & Infrastructure Engineering

•   Architect and maintain multi-cloud infrastructure (AWS and Google Cloud Platform and Azure) to support enterprise-scale healthcare operations.
•   Implement Infrastructure-as-Code (IaC) using Terraform, Helm charts, and CloudFormation to automate resource provisioning and ensure consistency across environments.
•   Configure, manage, and deploy Kubernetes clusters and cloud native tooling according to best practices for scaling, resiliency, and reliability.
•   Manage and optimize multi-cluster Kubernetes environments, utilizing Istio service mesh for advanced traffic management, service discovery, and observability.
•   Enforcing security standards and SAST, DAST, code quality scans on the Kubernetes cluster and application pipelines and containers and remediating any security findings related to the platform.
•   Design solutions that are aligned with business value in regard to TCO and ROI.
•   Design cross-region disaster recovery strategies and rolling deployment architectures to ensure high availability and business continuity for mission-critical applications.

DevOps & Automation

•   Develop and maintain CI/CD pipelines using BitBucket Pipelines and ArgoCD, leveraging GitOps practices for automated testing and deployment.
•   Develop and implement database deployment pipelines using Liquibase
•   Create automation for infrastructure and application configuration using Ansible
•   Implement automated health validation and failover capabilities to facilitate zero-downtime updates.
•   Provide technical mentorship to engineering teams on event-driven architecture, streaming best practices, and DevOps methodologies.

Security, Compliance & Observability

•   Design comprehensive monitoring and observability solutions using Prometheus, Grafana, and OpenTelemetry for distributed tracing and system performance visibility.
•   Implement robust security architectures using Conjur Cloud for automated credential rotation and default-deny network policies.
•   Configure advanced authentication mechanisms, including SASL/SSL with SCRAM-SHA-512, and manage RBAC permissions for database clusters.
•   Ensure infrastructure adherence to regulatory compliance frameworks (HIPAA, SOC 2, ISO 27001) through automated policy enforcement and encrypted communication channels.
•   Building "Golden Signal" dashboards (Latency, Traffic, Errors, Saturation) that automatically populate for every new microservice.
•   Create automation to reduce TOIL and auto healing for remediations within monitoring
•   Track application and platform costs from all layers and optimize costs on the platform.

Nice to Have:

•   Event Streaming & Data Processing
•   Architect real-time stream processing applications using Apache Flink with high-availability configurations, incorporating RocksDB state management and persistent storage.
•   Build and manage event streaming infrastructure using Apache Kafka (Amazon MSK) and Kafka Connect, ensuring seamless data ingestion and integration.
•   Implement Debezium CDC (Change Data Capture) for real-time synchronization of MongoDB data streams.
•   Manage Confluent Schema Registry to support Avro, JSON, and Protobuf schemas, ensuring backward compatibility across data pipelines.
•   Automate data pipelines and orchestration using Dagster and cloud-native data services.

MLOps, AI Infrastructure
•   Design and implement AI platform architecture to support MCP and agents using cloud native tools and security best practices
•   Establish and maintain MLOps pipelines using MLflow and Kubeflow to support the training, deployment, and monitoring of machine learning models.
•   Design infrastructure to support Generative AI (Gen-AI) and Large Language Models (LLM), specifically regarding Retrieval-Augmented Generation (RAG) implementations.
•   Create monitoring systems around AI applications to ensure performance and accuracy
•   Configure and deploy vector database to maintain scalability and indexing performance for RAG based AI workloads
•   Implement security and guardrails around AI workloads

Qualifications:
Required:

•   BS degree in Computer Science or related field plus 4 years of relevant technology experience or equivalent combination of education & education & experience in lieu of degree, 4+ years of relevant expertise is required.
•   Hands-on experience supporting production-grade cloud infrastructure in at least one major cloud provider (AWS, Google Cloud Platform, or Azure).
•   Practical experience operating and maintaining Kubernetes-based platforms in production environments.
•   Experience with Infrastructure as Code (IaC) tools such as Terraform, Helm, or CloudFormation.
•   Working knowledge of CI/CD and GitOps practices, including automated testing and deployment pipelines.
•   Experience implementing or supporting monitoring, alerting, and observability solutions (metrics, logs, traces).
•   Strong troubleshooting skills across distributed systems, including performance, availability, and reliability issues.
•   Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash).
•   Experience participating in on-call rotations, incident response, and root cause analysis

Preferred:

•   Experience operating multi-cloud environments (AWS, Google Cloud Platform, Azure).
•   Experience with event streaming platforms such as Apache Kafka, Kafka Connect, or managed services (e.g., Amazon MSK).
•   Familiarity with service mesh technologies (e.g., Istio) and advanced traffic management patterns.
•   Exposure to stream processing frameworks (e.g., Apache Flink) and CDC tools such as Debezium.
•   Experience supporting MLOps or AI infrastructure, including ML pipelines, model deployment, or GenAI workloads.
•   Familiarity with observability standards such as OpenTelemetry and Golden Signals (Latency, Traffic, Errors, Saturation).
•   Experience working in regulated environments and supporting compliance frameworks (HIPAA, SOC 2, ISO 27001).
•   Experience implementing security best practices for cloud-native platforms (IAM, secrets management, RBAC).
•   Prior experience in platform engineering or internal developer platforms.
•   Exposure to cost optimization and FinOps practices in cloud environments.

Knowledge/Skills/Abilities:

•   Ability to multi-task effectively without compromising the quality of the work.
•   Excellent interpersonal, oral and written communication skills.
•   Detail oriented, organized, process focused problem solver, proactive, ambitious, customer service focused.
•   Ability to draw conclusions and make independent decisions with limited information.
•   Ability to respond to common inquiries from customers, staff, regulatory agencies, vendors and other members of the business community.
•   Self-motivated, reliable individual capable of working independently as well as part of the team.
•   Motivated to drive improvement in a challenging environment.
•   Strong background in data structures, algorithms and debugging.
•   Demonstrated technical leadership, and successful participation in projects involving multiple engineers.
•   Ability to learn quickly, understand complex systems and to work closely with others across multiple teams.
•   Ability to handle uncertainty, time pressure and large technical challenges.
•   Ability to deliver high-quality work on time.
•   Strong attention to details, highly organized, computer literate

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10113037
Position Id: 528-14401-
Posted 21 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Irvine, California

•

Today

Senior Messaging Platform SRE (Kafka & RabbitMQ) Irvine, CA/San Antonio, TX (Hybrid) Full-Time JPC - 19858 Solugenix is seeking a skilled Senior Messaging Platform SRE (Kafka & RabbitMQ) for a full-time, hybrid position based in Irvine, CA/San Antonio, TX. We are seeking a Senior Messaging Platform SRE to own the reliability, scalability, and operational excellence of enterprise messaging and event-streaming platforms, including Confluent Kafka and RabbitMQ, running on AWS and the Confluent P

Easy Apply

Full-time

$$65 - $70/hr

Senior Site Reliability Engineer

Costa Mesa, California

•

Today

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century's most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril's family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and c

Full-time

USD 166,000.00 - 220,000.00 per year

Sr. IT Linux Site Reliability Engineer

Hawthorne, California

•

Today

SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. SR. LINUX SITE RELIABILITY ENGINEER SpaceX is looking for an experienced engineer with deep working knowledge of Kubernetes and related containerized technologies. This employee will be a member of the Information Tec

Full-time

USD 160,000.00 - 220,000.00 per year

Sr. Site Reliability Engineer - Top Secret Clearance (Starlink)

Hawthorne, California

•

Today

SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. SR. SITE RELIABILITY ENGINEER - TOP SECRET CLEARANCE (STARLINK) At SpaceX we're leveraging our experience in building rockets and spacecraft to deploy Starlink, the world's most advanced broadband internet system. Sta

Full-time

USD 160,000.00 - 220,000.00 per year

Search all similar jobs

Site Reliability Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs