IT Monitoring / Full-Stack Observability Engineer to design, implement, and operate enterprise monitoring solutions across on-premises infrastructure and Microsoft Azure cloud environments. This role will be responsible for delivering end-to-end observability—covering infrastructure, applications, services, logs, metrics, traces, and alerting—with a primary focus on SolarWinds (self-hosted) for on-prem monitoring and Azure Monitor for cloud monitoring.
The ideal candidate has strong technical depth in monitoring architecture, alert strategy, dashboarding, and operational response, and can partner with infrastructure, application, and security teams to ensure high availability and performance.
Monitoring / Observability Platform Ownership
Own and enhance enterprise monitoring using:
SolarWinds (Self-Hosted) for on-prem infrastructure monitoring (network, server, storage, virtualization).
Azure Monitor for Azure cloud monitoring (metrics, logs, alerts, workbooks).
Define and maintain standards for instrumentation, telemetry collection, alerting, and dashboards across hybrid environments.
On-Prem Monitoring (SolarWinds)
Administer and optimize SolarWinds platform components (e.g., Orion modules where applicable).
Configure monitoring for:
Network devices (SNMP/WMI/ICMP), routers/switches/firewalls
Windows/Linux servers
VMware/Hyper-V and storage platforms (as applicable)
Build actionable alerts with escalation policies, suppressions, dependencies, and maintenance windows.
Troubleshoot polling, credentialing, discovery, and performance issues for SolarWinds services and SQL back-end (as needed).
Azure Monitoring (Azure Monitor / Log Analytics / Application Insights)
Implement Azure-native monitoring strategy using:
Azure Monitor Metrics + Alerts
Log Analytics Workspaces
Application Insights (where applicable)
Workbooks for visualization and reporting
Create and maintain KQL queries for logs/insights and operational analytics.
Establish alert rules for Azure resources (VMs, AKS, App Services, Functions, SQL, Storage, networking, etc.).
Full-Stack Observability Practices
Drive adoption of observability best practices:
Golden signals (latency, traffic, errors, saturation)
SLOs/SLIs (where applicable)
Noise reduction and alert fatigue prevention
Ensure dashboards tell an operational story (health, performance, capacity, and trends).
Support incident response by correlating signals across on-prem + cloud.
Automation, Integration & ITSM
Automate monitoring configuration and reporting through scripting (PowerShell, Python) and IaC (Terraform/Bicep as appropriate).
Integrate monitoring alerts with ITSM tools (e.g., ServiceNow/Jira/Remedy) and collaboration channels (Teams/Email).
Support continuous improvement through post-incident reviews and monitoring enhancements.
Documentation & Governance
Maintain runbooks, SOPs, monitoring standards, and service maps.
Ensure monitoring adheres to security and compliance requirements (access controls, logging retention, least privilege).