Apply Now

Sr. Infrastructure Site Reliability Engineer

Southlake, TX, US • Posted 1 day ago • Updated 11 minutes ago

Full Time

On-site

USD $139,000.00 - 161,000.00 per year

Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

Creative Problem Solving
Finance
Financial Planning
Operational Excellence
Accountability
Operating Systems
Middleware
SAN
Software Engineering
Roadmaps
Stakeholder Engagement
Leadership
SAFE
Instrumentation
Continuous Improvement
Provisioning
Change Management
Storage
Optimization
Failover
Scalability
Incident Management
Root Cause Analysis
Documentation
Cyber Security
Computer Science
Science
IT Management
Management Information Systems
Reliability Engineering
Hosting
Microservices
Message Queues
Caching
API
VMware
Linux
Microsoft Operating Systems
Microsoft Windows Server
Configuration Management
Amazon Web Services
Microsoft Azure
Productivity
Scripting
Python
Windows PowerShell
Bash
Ansible
Terraform
Splunk
Grafana
AppDynamics
Dynatrace
Software Development Methodology
Analytical Skill
Conflict Resolution
Problem Solving
Service Level
Budget
High Availability
Disaster Recovery
Regulatory Compliance
Capacity Management
Forecasting
Google Cloud Platform
Google Cloud
Cloud Computing
Software Development
Continuous Integration and Development
Bitbucket
GitHub
Qualys
Progress Chef
Management
Database
Oracle Db
PostgreSQL
MongoDB
Computer Networking
Wireshark
Nmap
Tcpdump
Nagios
Apache JMeter

Summary

Your Opportunity

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.

Schwab Technology Services enables the future of how clients manage their money by providing innovative and reliable technology products and services as part of our ongoing commitment to democratize access to investing and financial planning.

A Manager for Advisor Services Technology (AST) Infrastructure Operations SRE will lead the strategy, execution, and operational excellence of the application infrastructure ecosystem supporting AST platforms. This role is accountable for ensuring high availability, scalability, reliability and performance through disciplined operational practices, life cycle management, and modern SRE principles. This requires an oversight of all routine and strategic infrastructure initiatives, including operating system upgrades, patching, EOL remediation, infrastructure changes, middleware and database activities, cloud technologies and readiness, tooling modernization, and automation at scale. You will drive holistic capacity management, ensuring that compute, storage, network and application-tier resources are designed and maintained to meet current and future business demand.

You will partner closely with architecture and application engineering teams to ensure infrastructure and platform components align with solution designs and support the long-term technical roadmap. The role also governs the organization's observability platforms - defining the telemetry strategy, metrics, SLOs, and alerting posture necessary to maintain operational health and reduce toil. You will lead ongoing improvements in automation, resilience engineering, disaster recovery readiness, and operational maturity, creating repeatable, well-engineered processes that support rapid change with minimal risk.

This role requires a deep understanding of enterprise infrastructure and security principles, excellent analytical skills, and the ability to communicate effectively with technical and non-technical stakeholders.

What you're good at
Strategic thinker who is passionate about application infrastructure reliability and efficiency.
Strong stakeholder engagement - able to work with application teams, I&O, and senior leadership. Drive consensus, negotiate priorities, and resolve conflicts.
Effective decision-maker driving solutions and leadership updates during high-pressure incidents.
Leads with integrity and sound judgment, showing the courage to uphold what's right in all situations.
High standard of change management quality by enforcing rigor, reducing operational risks, and ensuring predictable, safe deployments.
Practice Site Reliability Engineering mindset and solve problems through automation and instrumentation.
Identify opportunities to build innovative tools and solve unique operations problems on large enterprise and mission critical applications.
Drive continuous improvement via automation across infrastructure provisioning, configuration management, compliance, system health, and operational activities.
Monitor the current state of infrastructure to identify deficiencies through aging of the technologies used by the application, or misalignment with business requirements.
Analyze the business-IT environment (run, grow and transform the business) to detect critical deficiencies, and recommend solutions for improvement.
Govern change management practice, ensuring minimal service impact of infrastructure changes and activities.
Lead capacity planning across compute, storage and application tiers to ensure scalability and optimization.
Implement proactive monitoring and forecasting to prevent performance degradation across all supported platforms (on-prem and cloud technologies).
Partner with architecture teams to improve system resiliency, failover design, and scalability patterns.
Establish standards for tooling around runbooks, incident response, and environment configuration.
Lead complex incident triage and root-cause analysis, drive action plans to eliminate recurrences.
Coordinate DR exercises, ensuring process and documentation accuracy, and cross-team alignment.
Oversee Cybersecurity risks, threat and vulnerability programs.

What you have

Required Qualifications

Master's degree in Computer Science, Master of Science, Information Technology Management, Management Information System or a related field.
10+ years of experience in Site Reliability Engineering and Production Operations.
Deep knowledge of application hosting patterns: distributed systems, microservices, message queues, caching, API gateways.
Expertise in managing infrastructure (VMware, Linux, Windows Server, SAN/NAS, Load balancers, Containers- PCF), and configuration management.
Knowledge of cloud platforms (Google Cloud Platform, AWS, Azure) and cloud-native SRE practices.
Proven experience with automation and scripting - observability metrics, and productivity enhancements with scripting languages and tooling like Python, PowerShell, Bash, Ansible, SaltStack, Chef, Terraform.
Strong working experience with observability platforms (Splunk, Grafana, AppDynamics, ITRS, Dynatrace, etc)
Familiarity with secure coding practices and software development methodologies.
Excellent analytical and problem-solving skills to identify, assess, and prioritize production outage resolution effectively.
Strong understanding of service-level objectives (SLOs), error budgets, resilience patterns, and failure-mode analysis.
Solid working knowledge of Schwab resiliency policy - design high availability and disaster recovery architectures.
Experience in security compliance and threat remediation.
Hands-on capacity management experience, analyze and forecast resource utilization.

Preferred Qualifications

Google Cloud Certification - Associate Cloud Engineer.

Experience in software development, CICD pipeline is beneficial - Bitbucket, Github.
Familiarity with security standards and frameworks. Knowledge of Veracode and Qualys scans, Chef InSpec, Certificate management and vulnerability remediation.
Knowledge of database platforms - Oracle DB, MsSQL, Postgres, Mongo.
Understanding of networking tools like Wireshark, Nmap, tcpdump, Nagios, JMeter.

In addition to the salary range, this role is also eligible for bonus or incentive opportunities.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90989465
Position Id: ee5446d492d72f69c12b4691abd83849
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Site Reliability Engineer

Coppell, Texas

•

13d ago

Senior Site Reliability Engineer combination of deep operational expertise and hands-on engineering ability. The majority of your time (~70%) will be focused on owning the reliability, availability, scalability, and operational excellence of the cloud infrastructure and SaaS platforms powering our business. The remaining ~30% puts you directly in the platform engineering flow: building automation, improving deployment pipelines, and driving reliability initiatives from conception through produc

Easy Apply

Full-time

Depends on Experience

Site Reliability Engineer

Westlake, Texas

•

Today

Description: Hybrid (50-75% on site) We are seeking a Site Reliability Engineer to join a large-scale enterprise infrastructure organization within the financial services industry. This team is responsible for ensuring the reliability, scalability, and resilience of thousands of production systems supporting critical business functions. This role blends systems engineering, software development, and operations excellence, with a strong emphasis on automation, infrastructure as code, observabi

Contract

Asset & Wealth Management - Site Reliability Engineer - Vice President - Richardson

Richardson, Texas

•

Today

Job Description Site Reliability Engineer - Vice President Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run scalable, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of the firm's most critical platform services and ensures they meet the requirements of our internal and external users. It is also responsible for firmwide policies and

Full-time

Senior Software Development Engineer (Site Reliability)

Richardson, Texas

•

Today

We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health , you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time. Position Summary The Site Reliability Engineer (SRE) is

Full-time

USD 92,700.00 - 203,940.00 per year

Search all similar jobs

More jobs at Charles Schwab in Southlake, TX