This position will be responsible for implementing and maintaining an enterprise-wide monitoring solution w/in a 24/7 production environment.
Required Skills:
Enterprise monitoring and observability, including application performance monitoring (APM), infrastructure monitoring, and event management.
Experience working with AppDynamics (preferred) but will consider experience with at least three of the following Server Management products: Prometheus, Azure Monitor, SCOM, BMC AIOps or WhatsUpGold.
Networking experience in an enterprise environment
Working with both Windows and UNIX/Linux based systems in an enterprise environment, including advanced shell scripting.
Strong understanding of cloud-native monitoring principles (especially Azure).
Strong grasp of enterprise application architectures and monitoring challenges across distributed systems.
Self-motivated, detail-oriented, and capable of working independently or in cross-functional teams, with the ability to operate effectively in fast-paced, incident-driven environments.
ROLES AND RESPONSIBILITIES:
-Design, deploy, and manage end-to-end monitoring and observability solutions across infrastructure, applications, databases, and cloud services.
-Administer, maintain, and support enterprise monitoring tools, including routine upgrades, license management, and system tuning.
-Manage and maintain existing monitors across platforms, ensuring they remain relevant, accurate, and aligned with evolving system and business requirements.
-Review and recommend alert thresholds, KPIs, and escalation criteria to ensure actionable and meaningful alerting.
-Collaborate with application, infrastructure, database, and network teams to define monitoring strategies for performance, availability, and reliability.
-Integrate tools such as AppDynamics, Prometheus, Azure Monitor, SCOM, BMC AIOps, WhatsUpGold, and xMatters into a cohesive observability platform.
-Build and maintain dashboards, metrics visualizations, and alert rules using Azure Monitor, Application Insights, and Grafana.
-Support synthetic and real-user monitoring to measure application performance and user experience.
-Streamline alerting workflows and escalation paths using tools like xMatters to support 24x7 incident response.
-Help teams identify monitoring gaps, reduce noise and false positives, and enhance correlation using tools like BMC AIOps.
-Oversee monitoring of SSL certificates, URLs, and Key Vault secrets, ensuring proactive renewal and alerting.
-Analyze telemetry and diagnostic data to aid troubleshooting and incident root cause analysis.
-Provide monitoring expertise and implementation support for IT projects and new technology rollouts.
-Document tool configurations, monitoring policies, operational procedures, and troubleshooting guides.
-Continuously evaluate observability coverage to ensure system visibility, scalability, and security.
-Deliver periodic reports on system health, capacity trends, alert volumes, and KPIs to leadership and stakeholders.