Location: USA / REMOTE
Key Responsibilities
1. SRE Fundamentals & Reliability Engineering
Apply core SRE principles including:
SLIs, SLOs, SLAs definition and governance
Error budgets and reliability trade-offs
Incident management and postmortems
Partner with SRE L2/L3 teams to improve system reliability and performance
2. Observability Strategy & Tool Recommendation (Core Responsibility)
Act as the central point of expertise for Splunk and Dynatrace capabilities
Analyze requirements provided by:
Application developers
SRE L2/L3 engineers
Research and determine:
Whether requirements can be fulfilled using Splunk, Dynatrace, or both
The most efficient, scalable, and cost-effective implementation approach
Translate business and technical requirements into tool-specific solutions
Recommend best practices, design patterns, and architecture for observability
Continuously evaluate new features and enhancements in Splunk and Dynatrace
3. Splunk Engineering
Design and optimize Splunk-based logging and monitoring solutions
Develop advanced SPL queries, dashboards, and alerts
Define log onboarding strategies and data models
Ensure data quality, governance, and cost efficiency
Provide guidance on when and how to use Splunk effectively
4. Dynatrace Expertise
Configure and optimize Dynatrace for APM, RUM, and synthetic monitoring
Leverage AI-driven anomaly detection and root cause analysis
Map business transactions and critical user journeys
Guide teams on best utilization of Dynatrace capabilities
5. Azure Observability
Implement and integrate monitoring solutions within Microsoft Azure
Work with services such as:
Azure App Services, AKS, Azure Functions
Azure Monitor, Log Analytics, Application Insights
Ensure seamless integration between Azure, Splunk, and Dynatrace
6. Automation & Enablement
Develop automation scripts using Python, PowerShell, or Bash
Enable self-service observability for engineering teams
Integrate monitoring tools with ServiceNow, Jira, or similar platforms
Provide documentation, standards, and reusable templates
7. Collaboration & Advisory
Act as a trusted advisor to developers and SRE teams
Conduct requirement intake sessions and translate them into solutions
Provide training and guidance on observability best practices
Drive adoption of standardized monitoring approaches across teams
Required Qualifications
5+ years of experience in SRE, DevOps, or Observability Engineering
Strong understanding of SRE fundamentals (SLIs, SLOs, error budgets, incident management)
Deep hands-on experience with:
Splunk (log ingestion, SPL, dashboards, alerting)
Dynatrace (APM, RUM, synthetic monitoring)
Strong expertise in Microsoft Azure
Experience supporting large-scale, customer-facing platforms
Proficiency in scripting (Python, PowerShell, or Bash)
Strong analytical and problem-solving skills
Preferred Qualifications
Experience in retail/e-commerce environments
Knowledge of microservices and distributed systems
Experience with AKS, Docker, and containerized environments
Familiarity with additional observability tools (Prometheus, Grafana, ELK)
Certifications in Splunk, Dynatrace, or Azure