Senior Site Reliability Engineer

Overview

On Site
Full Time

Skills

Finance
ICE
Mortgage
Regulatory Compliance
Reliability Engineering
Recovery
Trend Analysis
Cloud Computing
KPI
Collaboration
Functional Requirements
Incident Management
Computer Science
Computer Engineering
Mathematics
Amazon Web Services
Microsoft Windows Server
Microsoft Operating Systems
Linux
File Systems
Client/server
Network Protocols
Terraform
Progress Chef
Puppet
Ansible
DSC
Cloud Architecture
Fluency
Scripting
DevOps
Python
Windows PowerShell
Ruby
Perl
Java
.NET
Problem Solving
Conflict Resolution
Splunk
Grafana
Microsoft Exchange

Job Details

Overview

Job Purpose
ICE Mortgage Technology (IMT) is the leading cloud-based platform provider for the mortgage finance industry. ICE Mortgage Technology solutions enable lenders to originate more loans, reduce origination costs, and reduce the time to close, all while ensuring the highest levels of compliance, quality and efficiency.

This is an exciting opportunity for a Senior Engineer in the Site Reliability Engineering team to provide resilient and secure services, design reliable, scalable and stable systems, and build actionable alerts/automation for preventing incidents and to detect performance bottlenecks. A Senior Engineer will also quickly troubleshoot issues to restore service.

Responsibilities
  • Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.
  • Work closely with development teams to ensure services are resilient and highly available.
  • Implement proactive monitoring, alerting, trend analysis, and self-healing systems.
  • Coding and automation of applications on Cloud Platform.
  • Define and measure KPIs and SLOs.
  • Collaborate with Product and Support teams to plan and deploy product releases.
  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems.
  • Partner with other SREs and lead by example - contributor more than a delegator.
  • Incident management during high stress issues and timelines.
  • Follow incident management lifecycle. Ensure issues are well documented, and fixes are implemented to ensure incidents do not repeat.

Knowledge and Experience
  • 7+ years of Systems/Applications automation and incident response in 24x7 Production Services environments.
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience.
  • Experience supporting large scale services running in AWS.
  • Knowledge of Windows Server and/or Linux systems internals (system libraries, file systems, kernel) and client-server network protocols.
  • Experience with IaC, utilizing tools like Terraform, CloudFormation, Spacelift, Chef, SaltStack, Puppet, Ansible, and DSC.
  • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems. Experience with elastic scaling, fault tolerance, and other cloud architecture patterns.
  • Fluency with one or more current generation scripting languages used by DevOps professionals (Python, Powershell, Ruby, Perl) or Java/.NET development.
  • Excellent troubleshooter, utilizing a systematic problem-solving approach.
  • Experience with Observability tooling such as CloudWatch, Splunk, OpenTelemetry, Prometheus, Grafana

-

Intercontinental Exchange, Inc. is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to legally protected characteristics.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.