Manager, Software Development & Engineering

Southlake, TX, US • Posted 4 hours ago • Updated 4 hours ago
Full Time
On-site
Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

  • Creative Problem Solving
  • Finance
  • Financial Planning
  • Risk Management
  • Stakeholder Communications
  • Dashboard
  • Batch File
  • Documentation
  • Version Control
  • Collaboration
  • Mentorship
  • Reliability Engineering
  • Production Support
  • Root Cause Analysis
  • Operational Risk
  • SLA
  • Splunk
  • Computer Networking
  • Database
  • Scripting
  • Python
  • Shell
  • Bash
  • Windows PowerShell
  • GitHub
  • Software Configuration
  • Software Release Life Cycle
  • Continuous Integration
  • Continuous Delivery
  • Failover
  • RPO
  • Testing
  • SQL
  • Data Validation
  • JIRA
  • Scrum
  • BMC Remedy
  • Authorization
  • Communication
  • Docker
  • Linux
  • Microsoft Windows
  • High Availability
  • AppDynamics
  • Grafana
  • ExtraHop
  • Performance Monitoring
  • Enterprise Storage
  • Access Control
  • Encryption
  • Storage
  • Disaster Recovery
  • Recovery
  • Oracle
  • Workflow
  • Offshoring
  • Financial Services
  • Software Maintenance
  • IT Service Management
  • Incident Management
  • Software Design
  • Agile
  • Software Development
  • Management
  • Regulatory Compliance
  • Strategic Thinking

Summary

Your Opportunity

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.

Schwab Technology Services enables the future of how clients manage their money by providing innovative and reliable technology products and services as part of our ongoing commitment to democratize access to investing and financial planning.

This is a senior technical role focused on Site Reliability Engineering for critical enterprise applications and platforms. The role combines hands-on production support, observability, incident prevention, release reliability, automation, operational resilience, and support for compliance and regulatory expectations in enterprise environments.
The position supports high-impact incident response, improves operational standards, mentors onshore and offshore engineers, and communicates clearly with both technical and business stakeholders. It is a strong fit for someone who wants to improve reliability, reduce operational risk, and scale support through automation and better engineering practices.

Key Responsibilities

Lead production support, operational readiness, and reliability risk management for critical services and dependencies. Manage major incident triage, escalation, recovery, stakeholder communications, and closure activities, including coordination through Remedy or similar enterprise ticketing and incident management tools, with execution aligned to SLAs.
Work closely with Development and Business Product Owner teams to align reliability priorities, release readiness, and incident communication; identify SLIs, determine SLOs, and plan remediations aligned to business outcomes.
Improve observability through dashboards, alerting, event correlation, and trend-based early warning. Support release reliability through deployment validation, rollback preparedness, readiness checks, and post-release verification.
Build and maintain automation using Python, Bash, Windows Batch scripting, and PowerShell to standardize support processes, improve recovery actions, create reusable solutions, and reduce toil through automation.
Develop automation for monitoring, deployment validation, routine operational tasks, recovery procedures, incident response workflows, and process efficiency improvements.
Support disaster recovery planning, zonal isolation planning and execution, recovery testing, certificate-related operational needs, and secure production readiness.
Support compliance and regulatory requirements through disciplined operational controls, documentation, and reliable execution.
Use GitHub and other software configuration management tools for source control, collaboration, workflow support, and governance.
Apply security knowledge and access grouping concepts to support secure operations, platform access controls, and operational readiness.
Mentor engineers on troubleshooting, automation, SRE and observability disciplines, and cross-time-zone handoffs, and contribute to architecture reviews to improve operability, resilience, and maintainability.

What you have

Required Qualifications

Strong experience in Site Reliability Engineering, observability, production support, and enterprise platform operations.
Proven experience managing major incidents, root cause analysis, service account or password restoration, and operational risk reduction in complex production environments with strong SLA-driven execution.
Strong hands-on experience with Splunk or similar monitoring and observability platforms.
Strong troubleshooting skills across applications, infrastructure, platforms, networking, databases, storage, and integrated service dependencies.
Strong scripting and automation skills using Python, Shell/Bash, Windows Batch, and PowerShell to improve operational support, monitoring, deployment validation, recovery procedures, and repetitive task reduction.

One year of Schwab technology domain experience gained as a current or recent contractor or employee
Experience building reusable automation solutions that improve consistency, reduce manual effort, and reduce toil through automation.

Experience with GitHub and other software configuration management tools. Experience in build and release management, CI/CD practices, deployment controls, and release reliability processes.
Experience supporting applications on PCF and operating in distributed production environments.
Working knowledge of resiliency and recovery, including HA patterns, zonal isolation, failover/failback, RTO/RPO, recovery testing, and post-recovery validation, plus provide operational support including the ability to read and

write SQL queries for troubleshooting and data validation.
Familiarity with Jira and Scrum concepts, along with experience using Remedy or similar enterprise incident and ticket management platforms.
Understanding of security concepts and grouping models, including access controls, security groups, role-based access, or similar enterprise authorization practices.
Strong written and verbal communication skills, including the ability to explain technical issues, risks, and remediation plans to technical and business audiences.

Preferred Qualifications

Familiarity with Docker, Linux and Windows production environments, and high-availability distributed systems.
Experience with AppDynamics, Grafana, ThousandEyes, ExtraHop, or similar observability and performance monitoring tools.
Experience designing automation for alert correlation, deployment validation, recovery actions, and operational handoff workflows.
Familiarity with enterprise storage models, including NAS covering access control and permissions, encryption, storage quotas, retention/lifecycle controls, and operational troubleshooting.
Experience supporting disaster recovery exercises, zonal resilience strategies, and post-recovery validation.
Familiarity with MSSQL or Oracle, certificate lifecycle processes, secure transport, and enterprise operational controls.
Understanding of business workflow concepts, including upstream/downstream dependencies, client-request SLAs, and failure impact, and the ability to map reliability issues to end-to-end business outcomes.
Experience working across onshore and offshore support teams and in financial services or other highly regulated environments

Job Sub-Family Specific Competencies
  • Application Maintenance and Support - Delivering effective management and technical services to address technical issues and minimize disruption to application users
  • Incident ResponseIncident ResponseIncident ResponseApplication Maintenance and Support - Delivering effective management and technical services to address technical issues and minimize disruption to application users
  • Incident Response - Resolving reported incidents through streamlined processes, minimizing disruptions, and promptly restoring services
  • Software Design and Specifications - Developing software solutions that meet requirements using established design principles and standards, employing predictive or adaptive design techniques, including plan-driven or iterative/agile approaches
  • Software Development - Implementing standards, processes, and methods to create, test, and verify software components, ensuring reliability and resolving operational problems and bugs
  • Software Release and Deployment - Managing the deployment of software updates while ensuring compliance with safety, security, and quality standards
  • Strategic Thinking - Analyzing an organization's competitive position and developing a clear and compelling vision of what the organization needs for success in the future
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90989465
  • Position Id: c005a17b3c100adab3563747ee8817eb
  • Posted 4 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Coppell, Texas

2d ago

Easy Apply

Full-time

Depends on Experience

Coppell, Texas

22d ago

Full-time

Hybrid in Coppell, Texas

Today

Full-time

Southlake, Texas

Today

Full-time

USD 105,600.00 - 234,600.00 per year

Search all similar jobs