SRE Lead Platform Engineer Dynatrace & Azure - Fully remote

Remote • Posted 1 hour ago • Updated 1 hour ago
Contract W2
Remote
$70+
Fitment

Dice Job Match Score™

🔢 Crunching numbers...

Job Details

Skills

  • Dynatrace
  • Azure

Summary

Job Title: SRE Lead Platform Engineer- Remote
Duration: 6 Months to Hire
Location: Fully remote, EST

The key skills for this Lead SRE Platform Engineer role are observability and monitoring (MELT data) using tools like Dynatrace, Datadog, and SCOM, strong Azure cloud and hybrid infrastructure knowledge, and DevOps automation with CI/CD, GitHub, and Terraform. The role also requires programming for automation (Python, C#, SQL) and strong experience with incident management, root cause analysis, and reliability engineering practices. At a lead level, the focus is on defining monitoring standards, improving system reliability, and guiding cross-team efforts to reduce outages and improve platform performance.
Dynatrace
Datadog
Microsoft SCOM

A typical day for this engineer would be a mix of monitoring system health, investigating reliability issues, improving observability, and leading automation and infrastructure improvements.

Role Summary
As a Lead SRE Platform Engineer, you will drive reliability engineering strategy and execution across critical IT Business Solutions platforms at Wegmans. This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven root cause analysis (RCA).
You will serve as the technical lead for SRE practices establishing monitoring standards, improving MELT (Metrics, Events, Logs, Traces) strategy, influencing tooling decisions, and partnering across infrastructure, development, operations, and vendor teams. This is a high-impact opportunity to build and mature reliability engineering capabilities from the ground up.

What You ll Do
Reliability & Observability Leadership
Define and mature SRE best practices across cloud and on-prem environments.
Design and implement comprehensive monitoring strategies using tools such as:
o Dynatrace
o Datadog
o Microsoft SCOM
Develop dashboards, alerts, synthetic testing, and proactive monitoring capabilities.
Establish and evolve a MELT data strategy to improve service reliability.
Provide data-driven RCA investigations and implement preventative solutions.

Platform & Application Reliability
Support and enhance reliability across:
Cloud & Infrastructure
o Microsoft Azure (software, storage, Azure local)
o Hyper-V and legacy VMware environments
o NetApp and Pure storage platforms
o Azure log analytics
o Infrastructure as Code using Terraform
o Migration from Azure DevOps to GitHub (strong GitHub experience required)
Order Management Systems
o Azure-based, internally developed .NET/C# applications
o Internal message queuing systems
o Logging, analytics, and synthetic testing post-patching
o API-based integrations
Workforce & Payroll Platforms
o Workday (Payroll)
o ADP Vantage (Timekeeping)
Warehouse & Distribution Systems
o Blue Yonder Warehouse Management System (WMS)
o Vocollect handheld voice picking devices
o Network analytics for identifying dead zones and connectivity issues
o Barcode scanners and device connectivity troubleshooting

DevSecOps & Automation
Lead CI/CD reliability improvements (Azure DevOps GitHub transition critical).
Enhance pipeline automation with embedded security controls.
Advance Infrastructure-as-Code standards (Terraform).
Improve configuration management and change governance.
Drive automation to reduce manual intervention and operational risk.

ITSM & Incident Management
Work within BMC ecosystem including:
o BMC Helix
o BMC Remedy
o BMC Server Automation
Optimize automated incident generation (SCOM BMC workflows).
Improve triage, escalation, and impact modeling across services.
Monitor vendor performance and escalate appropriately.
Participate in off-hour escalation support when required.

Strategic Impact
Develop predictive reliability models using statistical techniques.
Identify systemic risk across production systems.
Guide tooling decisions (e.g., Dynatrace vs. Datadog or other observability platforms).
Ensure regulatory and operational compliance standards are met.
Facilitate cross-functional collaboration and document SRE procedures and planning artifacts.

Required Qualifications
5 7+ years of Software Engineering and Infrastructure/Database Engineering experience.
Deep expertise in:
o DevSecOps practices
o Observability platforms
o API integrations
o Performance management tools
o ITIL principles
o ITSM data analytics
o MELT data collection and analysis
Experience in Azure cloud environments.
Strong analytical and problem-solving skills.
Demonstrated ability to influence technical direction.
Excellent communication and cross-team collaboration skills.
Continuous improvement mindset focused on reliability engineering.

Preferred Qualifications
Strong programming experience in:
o .NET / C#
o Python
o SQL
Experience with MSSQL (primary) and Oracle (limited).
Experience with GitHub (critical for upcoming transition).
Agile/Scrum experience.
Knowledge of Reliability-Centered Engineering and maintenance strategies.
Experience with synthetic testing and proactive validation post-deployment.
Bachelor s degree in a related technical field.

Thank you,
Shiva Mittal

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: cxbcsi
  • Position Id: 2
  • Posted 1 hour ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Today

Contract

75-95/hr

Remote

12d ago

Easy Apply

Contract

85 - 90

Remote

27d ago

Easy Apply

Contract

$160,000 - $180,000

Remote

19d ago

Easy Apply

Contract

70 - 80

Search all similar jobs