Apply Now

Site Reliability Engineer (SRE) - Regional Multi Project Platform

• Posted 30+ days ago • Updated 2 days ago

Full Time

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

IaaS
FOCUS
High Availability
Failover
Disaster Recovery
Capacity Management
Performance Metrics
Quality Assurance
Reliability Engineering
Scalability
Continuous Improvement
Microservices
Distributed Computing
Provisioning
Lifecycle Management
Build Automation
Decision-making
SDK
Research
Incident Management
Root Cause Analysis
Debugging
Systems Design
SaaS
Operational Excellence
Cloud Computing
Terraform
Ansible
Jenkins
Streaming
Apache Kafka
Apache NiFi
Elasticsearch
MySQL
Vertica
Apache ZooKeeper
Grafana
Nginx
Linux
Artificial Intelligence
Workflow
Linux Administration
Amazon Web Services
OpenStack
Management
Kubernetes
Continuous Integration
Continuous Delivery
Software Engineering
Information Systems
Computer Science
C
C++
Java
Python
Policies and Procedures
Law
Recruiting

Summary

Company:
QUALCOMM SEMICONDUCTORES Y SISTEMAS AVANZADOS DE BAJA CALIFORNIA

Job Area:
Engineering Group, Engineering Group > Software Engineering

General Summary:

Cloud Infrastructure & Infrastructure as Code

Design, build, and manage cloud infrastructure with a primary focus on AWS, integrated with OpenStack environments
Build and maintain Infrastructure as Code using:
- Terraform
- Ansible
- Kubernetes (manifests / Helm)

Design infrastructure solutions for:
- Scalability
- High availability
- Performance
- Reliability
- Cost efficiency

Implement redundancy, failover, and disaster-recovery patterns across services and regions
Perform capacity planning based on performance metrics, usage trends, and utilization data

Kubernetes & Platform Reliability

Operate and scale production Kubernetes clusters in large-scale environments
Partner with development and QA teams to:
- Improve system reliability and resiliency
- Automate scalability and availability mechanisms

Apply SRE principles including:
- Service reliability ownership
- Proactive failure prevention
- Continuous improvement of operational processes

Support microservices-based and distributed system architectures

CI/CD, Automation & Operational Excellence

Manage and evolve CI/CD pipelines (e.g., Jenkins)
Automate infrastructure provisioning, configuration, and lifecycle management
Write, maintain, and improve runbooks for operational processes
Build automation to reduce manual intervention and operational toil
Plan and execute infrastructure upgrades and maintenance activities
Proactively identify and address technical and infrastructure debt

Data Platforms & Streaming Systems

Operate, tune, and scale data and streaming platforms, including:
- Kafka, Zookeeper
- NiFi
- Elasticsearch
- MySQL, Vertica

Diagnose and resolve performance and stability issues across data pipelines
Ensure data platform reliability, throughput, and resilience at scale

AI-Assisted SRE & Intelligent Automation

Design and maintain knowledge-driven automated runbooks and operational bots
Develop AI-assisted operational workflows, including:
- Incident analysis and summarization
- Intelligent diagnostics and remediation suggestions
- Automation of repetitive operational decision-making

Work with LLM-based agent frameworks (e.g., Claude Agent SDK or similar):
- Integrate agents with logs, metrics, monitoring, and internal tools
- Implement guard-railed, controlled-action automation for production use

Research and propose new concepts, tools, and AI-driven approaches to improve reliability and efficiency

Monitoring, Reliability & Incident Management

Design and operate monitoring and observability systems using:
- Prometheus
- Grafana
- ELK stack

Improve alert quality, signal-to-noise ratio, and troubleshooting efficiency
Lead incident response activities, root cause analysis, and post-incident reviews
Support software engineers in debugging complex production issues across distributed systems
Embed reliability, automation, and operational readiness into system design

Experience Required

Extensive experience operating large-scale distributed cloud systems
Hands-on experience with AWS in production environments
Direct experience working with OpenStack
Strong Linux background in large-scale SaaS or production systems
Ability to:
- Maintain and improve existing mission-critical systems
- Prioritize and systematically reduce technical and infrastructure debt

Strong understanding of designing for operational excellence, not just greenfield solutions

Required Skills

Programming: Strong experience with Python and/or Go
Cloud & IaC: Terraform, Ansible, CloudFormation or equivalent
Containers: Kubernetes (production experience)
CI/CD: Jenkins and modern CI/CD practices
Data & Streaming: Kafka, NiFi, Elasticsearch, MySQL, Vertica, Zookeeper
Observability: Prometheus, Grafana, ELK
Infrastructure: Nginx, Linux internals
AI / Automation (advantage):
- Experience integrating AI or LLMs into operational workflows
- Familiarity with agent-based automation concepts

Experience Guidelines

3+ years in:

overall experience managing infrastructure
Linux administration in large-scale environments
operating production systems on AWS and/or OpenStack
managing Kubernetes in production
using infrastructure as code
working with CI/CD systems

Minimum Qualifications:
Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field.
2+ years of academic or work experience with Programming Language such as C, C++, Java, Python, etc.

Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

If you would like more information about this role, please contact Qualcomm Careers.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: RTX171842
Position Id: dc1e29ed77c477ea3b403f43d1b9b052
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

New York, New York

•

2d ago

SeatGeek believes live events are powerful experiences that unite humans. With our technological savvy and fan-first attitude we're simplifying and modernizing the ticketing industry. The Platform organization helps turn that vision into reality by providing a secure and reliable foundation for building and shipping products. By reducing complexity and offering well-supported, self-service tools and workflows, we enable teams across SeatGeek to move with confidence, especially during the moment

Full-time

USD 144,000.00 - 209,000.00 per year

Cloud Engineer (Infrastructure Systems)

Remote

•

2d ago

Job Description Get Involved! Samaritan's Purse has an incredible opportunity for a Cloud Engineer to join our Information Technology department. This fully remote (or on-site) role is part of the Samaritan Ark Cloud Services team, with a strong focus on Infrastructure Systems. As a Cloud Engineer, you'll play a vital part in enabling ministries to spread the Gospel by building, maintaining, and evolving the foundational platforms that power our cloud platform. We run a fully on-premise, open-so

Full-time

Platform Engineer

Chicago, Illinois

•

Today

About Akuna: Akuna Capital is an innovative trading firm with a strong focus on collaboration, cutting-edge technology, data driven solutions, and automation. We specialize in providing liquidity as an options market-maker - meaning we are committed to providing competitive quotes that we are willing to both buy and sell. To do this successfully, we design and implement our own low latency technologies, trading strategies, and mathematical models. Our Founding Partners first conceptualized Akuna

Full-time

USD 145,000.00 per year

Member of Technical Staff (Software Engineer, Cloud Infrastructure)

New York, New York

•

2d ago

About the Role The Cloud Infrastructure team owns the foundational cloud primitives and deployment models that power Perplexity's products, from multi-tenant public cloud to single-tenant and on-premises solutions for enterprise customers. As Perplexity grows its Computer and Enterprise products, this team builds and operates the security, isolation, and compliance layers that customers depend on. We provide the deployment topologies, multi-region infrastructure, and core services that enable

Full-time

USD 220,000.00 - 405,000.00 per year

Search all similar jobs

Site Reliability Engineer (SRE) - Regional Multi Project Platform

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs