Apply Now

Lead Site Reliability Engineer

Remote • Posted 30+ days ago • Updated 1 hour ago

Full Time

Remote

Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

Performance Monitoring
High Availability
Continuous Improvement
Innovation
Trading
Network
Technical Direction
Budget
ROOT
Incident Management
Operational Excellence
Scalability
Operational Efficiency
Reliability Engineering
DevOps
Production Engineering
Leadership
IT Management
Team Leadership
Management
Mentorship
SLA
Reporting
Microsoft Azure
IaaS
PaaS
Network Monitoring
Dynatrace
Dashboard
Performance Analysis
Software Performance Management

Summary

Join our team as a Lead Site Reliability Engineer to drive system reliability, observability, and performance monitoring for mission-critical digital trading products. You will lead monitoring initiatives in a high-availability trading environment, ensuring stable connectivity to external partners while proactively identifying opportunities for continuous improvement. At EPAM, you'll work on cutting-edge technologies, solve complex challenges, and shape the future of digital innovation. With access to continuous learning, mentorship, and global projects, your expertise will drive meaningful change. Req# 968473077 Responsibilities Define and implement a strategic reliability vision for the trading portfolio, covering infrastructure, network connectivity, application performance, and throughput Lead and oversee a team of SRE engineers, providing technical direction, mentorship, and performance guidance Own and evolve the SLA/SLO/SLI framework, including error budgets and service health reporting Configure and optimize comprehensive monitoring and alerting systems across infrastructure and applications Drive observability best practices using APM and monitoring platforms (e.g., Dynatrace) Analyze application and infrastructure performance to isolate fault domains and determine root causes of critical incidents Lead major incident management, coordinate resolution efforts, and conduct blameless postmortems Participate in 24x7x365 support rotation and ensure operational excellence across the team Identify automation opportunities to improve reliability, scalability, and operational efficiency Requirements 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering Proven leadership experience (technical lead or team lead), with ability to oversee and mentor engineers Strong hands-on experience with SLA/SLO/SLI definition, governance, and reporting Solid experience working in Microsoft Azure environments (IaaS, PaaS, networking, monitoring) Hands-on experience with Dynatrace (configuration, alerting, dashboards, performance analysis) Experience with observability, monitoring, and APM tools in production environments Ability to operate effectively under pressure in time-sensitive, high-impact environments

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10330481
Position Id: c969960b474d92264a87241b6d69f977
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Site Reliability Engineer III

New York, New York

•

Today

Job Description As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team's strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate dif

Full-time

USD 133,000.00 - 185,000.00 per year

Director, Software Engineering (Site Reliability Engineering)

New York, New York

•

2d ago

{"description": "Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. As a Director of Site Reliability Engineering, you will own execution for reliability, availability, and operational excellence across Af

Full-time

USD 300,000.00 - 360,000.00 per year

Site Reliability Engineer - Data, Cloud & Developer Experience

New York, New York

•

3d ago

{"description": "Blackstone is the world's largest alternative asset manager. We seek to create positive economic impact and long-term value for our investors, the companies we invest in, and the communities in which we work. We do this by using extraordinary people and flexible capital to help companies solve problems. Our $1.1 trillion in assets under management include investment vehicles focused on private equity, real estate, public debt and equity, infrastructure, life sciences, growth equ

Full-time

USD 140,000.00 - 225,000.00 per year

Lead Site Reliability Engineer

New York, New York

•

Today

Full-time

USD 152,000.00 - 215,000.00 per year

Search all similar jobs