Site Reliability Engineer (SRE) - Regional Multi Project Platform

• Posted 4 hours ago • Updated 4 hours ago
Full Time
Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

  • IaaS
  • FOCUS
  • High Availability
  • Failover
  • Disaster Recovery
  • Capacity Management
  • Performance Metrics
  • Quality Assurance
  • Reliability Engineering
  • Scalability
  • Continuous Improvement
  • Microservices
  • Distributed Computing
  • Provisioning
  • Lifecycle Management
  • Build Automation
  • Decision-making
  • SDK
  • Research
  • Incident Management
  • Root Cause Analysis
  • Debugging
  • Systems Design
  • SaaS
  • Operational Excellence
  • Cloud Computing
  • Terraform
  • Ansible
  • Jenkins
  • Streaming
  • Apache Kafka
  • Apache NiFi
  • Elasticsearch
  • MySQL
  • Vertica
  • Apache ZooKeeper
  • Grafana
  • Nginx
  • Linux
  • Artificial Intelligence
  • Workflow
  • Linux Administration
  • Amazon Web Services
  • OpenStack
  • Management
  • Kubernetes
  • Continuous Integration
  • Continuous Delivery
  • Software Engineering
  • Information Systems
  • Computer Science
  • C
  • C++
  • Java
  • Python
  • Policies and Procedures
  • Law
  • Recruiting

Summary

Company:
QUALCOMM SEMICONDUCTORES Y SISTEMAS AVANZADOS DE BAJA CALIFORNIA

Job Area:
Engineering Group, Engineering Group > Software Engineering

General Summary:

Cloud Infrastructure & Infrastructure as Code
  • Design, build, and manage cloud infrastructure with a primary focus on AWS, integrated with OpenStack environments
  • Build and maintain Infrastructure as Code using:
    • Terraform
    • Ansible
    • Kubernetes (manifests / Helm)
  • Design infrastructure solutions for:
    • Scalability
    • High availability
    • Performance
    • Reliability
    • Cost efficiency
  • Implement redundancy, failover, and disaster-recovery patterns across services and regions
  • Perform capacity planning based on performance metrics, usage trends, and utilization data

Kubernetes & Platform Reliability
  • Operate and scale production Kubernetes clusters in large-scale environments
  • Partner with development and QA teams to:
    • Improve system reliability and resiliency
    • Automate scalability and availability mechanisms
  • Apply SRE principles including:
    • Service reliability ownership
    • Proactive failure prevention
    • Continuous improvement of operational processes
  • Support microservices-based and distributed system architectures

CI/CD, Automation & Operational Excellence
  • Manage and evolve CI/CD pipelines (e.g., Jenkins)
  • Automate infrastructure provisioning, configuration, and lifecycle management
  • Write, maintain, and improve runbooks for operational processes
  • Build automation to reduce manual intervention and operational toil
  • Plan and execute infrastructure upgrades and maintenance activities
  • Proactively identify and address technical and infrastructure debt

Data Platforms & Streaming Systems
  • Operate, tune, and scale data and streaming platforms, including:
    • Kafka, Zookeeper
    • NiFi
    • Elasticsearch
    • MySQL, Vertica
  • Diagnose and resolve performance and stability issues across data pipelines
  • Ensure data platform reliability, throughput, and resilience at scale

AI-Assisted SRE & Intelligent Automation
  • Design and maintain knowledge-driven automated runbooks and operational bots
  • Develop AI-assisted operational workflows, including:
    • Incident analysis and summarization
    • Intelligent diagnostics and remediation suggestions
    • Automation of repetitive operational decision-making
  • Work with LLM-based agent frameworks (e.g., Claude Agent SDK or similar):
    • Integrate agents with logs, metrics, monitoring, and internal tools
    • Implement guard-railed, controlled-action automation for production use
  • Research and propose new concepts, tools, and AI-driven approaches to improve reliability and efficiency

Monitoring, Reliability & Incident Management
  • Design and operate monitoring and observability systems using:
    • Prometheus
    • Grafana
    • ELK stack
  • Improve alert quality, signal-to-noise ratio, and troubleshooting efficiency
  • Lead incident response activities, root cause analysis, and post-incident reviews
  • Support software engineers in debugging complex production issues across distributed systems
  • Embed reliability, automation, and operational readiness into system design

Experience Required
  • Extensive experience operating large-scale distributed cloud systems
  • Hands-on experience with AWS in production environments
  • Direct experience working with OpenStack
  • Strong Linux background in large-scale SaaS or production systems
  • Ability to:
    • Maintain and improve existing mission-critical systems
    • Prioritize and systematically reduce technical and infrastructure debt
  • Strong understanding of designing for operational excellence, not just greenfield solutions

Required Skills
  • Programming: Strong experience with Python and/or Go
  • Cloud & IaC: Terraform, Ansible, CloudFormation or equivalent
  • Containers: Kubernetes (production experience)
  • CI/CD: Jenkins and modern CI/CD practices
  • Data & Streaming: Kafka, NiFi, Elasticsearch, MySQL, Vertica, Zookeeper
  • Observability: Prometheus, Grafana, ELK
  • Infrastructure: Nginx, Linux internals
  • AI / Automation (advantage):
    • Experience integrating AI or LLMs into operational workflows
    • Familiarity with agent-based automation concepts

Experience Guidelines

3+ years in:
  • overall experience managing infrastructure
  • Linux administration in large-scale environments
  • operating production systems on AWS and/or OpenStack
  • managing Kubernetes in production
  • using infrastructure as code
  • working with CI/CD systems

Minimum Qualifications:
Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field.
2+ years of academic or work experience with Programming Language such as C, C++, Java, Python, etc.

Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

If you would like more information about this role, please contact Qualcomm Careers.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: RTX171842
  • Position Id: dc1e29ed77c477ea3b403f43d1b9b052
  • Posted 4 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

New York, New York

Today

Full-time

USD 111,000.00 - 218,000.00 per year

Remote

Today

Full-time

USD 87,100.00 - 157,450.00 per year

California

Today

Full-time

USD 151,600.00 - 245,300.00 per year

California

Today

Full-time

USD 120,300.00 - 194,525.00 per year

Search all similar jobs