Job Title: Senior Monitoring and Observability Lead 17+ years of experience
Location: New York, NY (Hybrid)
We are currently seeking candidates who meet the following qualification
Mandatory Qualifications
Enterprise Platform Evaluation & Implementation: Ability to evaluate tools such as Datadog, Splunk, Dynatrace, and SolarWinds etc., define selection criteria, and deliver a hands-on implementation plan and migration approach.
Telemetry Fundamentals: Strong understanding of logs/metrics/traces, event correlation, time-series data, and dashboard construction; familiarity with modern instrumentation patterns (Open Telemetry preferred). Infrastructure & Network Monitoring: Practical knowledge of SNMP, syslog, WMI, APIs, and agent-based data collection; comfort monitoring WAN/LAN/Wi-Fi performance, firewall/load balancer signals, and critical service dependencies. Cloud Monitoring: Experience monitoring workloads and services in at least one major cloud (Azure/AWS/Google Cloud Platform), including identity, networking, and compute telemetry.
ITSM / Workflow Integration: Experience integrating monitoring with ticketing, routing, escalation, and knowledge workflows; ability to design severity and ownership models. Documentation & Governance: Ability to write clear technical documentation, standards, and runbooks suitable for institutional and audit needs.
AIOps capabilities such as anomaly detection, dynamic baselining, event deduplication, correlation, and predictive insights.
Service topology mapping, dependency analysis, and service health models (SLIs/SLOs preferred).
Datadog, Splunk Observability, Dynatrace, SolarWinds, or comparable enterprise observability platforms.
Centralized logging and analytics approaches; understanding of retention, indexing/cost management, and governance.
Windows/Linux monitoring, virtualization platforms (VMware/Hyper-V), storage and backup monitoring, network performance and configuration monitoring.
Operational alignment with CIS Benchmarks and secure monitoring practices (least privilege, secrets handling, encryption in transit, RBAC, auditability).
Relevant certifications (preferred, not required): ITIL Foundation, Security+, cloud certifications (Azure/AWS/Google Cloud Platform), vendor observability certifications.
Experience producing executive dashboards and institutional KPI reporting (availability, performance, incident trends, capacity, risk posture).
Ability to analyze complex systems, identify root causes, and implement durable fixes.
Ability to communicate clearly with both technical and administrative audiences.
Strong organizational skills and ability to prioritize competing needs.
Service-oriented mindset aligned to the institution's mission and stakeholder support expectations.
Experience with Cisco enterprise operations tooling and integration such as: TACACS+/RADIUS, SSO, certificate lifecycle, device compliance/drift detection, and automated configuration deployment workflows.
Familiarity with campus scale operational needs (change windows tied to academic schedules, distributed support models, and stakeholder communication).
Duties/Responsibilities:
Provide support for SolarWinds alerting through current integrations, implement upgrades and enhancements, enable features
Design and implement an end-to-end observability approach spanning metrics, logs, traces, and events across on-prem and cloud environments.
Lead hands-on evaluation and implementation efforts for enterprise platforms including Datadog, Splunk
Observability, Dynatrace, and SolarWinds, aligning tool capabilities to institutional requirements (availability, performance, security, scalability, cost).
Build and maintain telemetry collection standards (agent based and agentless), tagging/metadata conventions, and service dependency views to improve root-cause isolation and service health reporting.
Establish durable operating practices for instrumentation, onboarding, configuration management, lifecycle upgrades, and platform reliability.
Implement alerting strategies that prioritize actionable notifications, reduce noise, and improve time-to-detect (MTTD) and time-to-resolve (MTTR).
Develop and tune thresholds, dynamic baselines, anomaly detection, and intelligent event correlation (AIOps) to support 24x7 service reliability.
Support other infrastructure teams in creating runbooks, escalation standards, and response procedures. Role may require occasional support to fix issues hampering alerting and monitoring system
Contribute to post-incident reviews with measurable improvement outcomes such as alert tuning, automation, capacity adjustments, resilience enhancements.
Build automation using APIs and scripting
(PowerShell/Python) to standardize onboarding, reduce repetitive operations, and support self-service dashboards for campus IT teams.
Integrate monitoring and alerting with enterprise workflows such as ITSM ticketing and routing through ServiceNow. Implement observability-as-code practices where feasible for repeatable deployment, configuration drift reduction, and consistent governance.
Partner with CUNY Infrastructure and Security teams to strengthen configuration practices aligned to CIS Benchmarks and other institutional hardening standards.
Build and maintain executive dashboards and reporting that highlight configuration drift, operational risks, and compliance posture relevant to servers, endpoints, network devices, and cloud resources.
Ensure observability agents, collectors, and integrations follow least-privilege access, secure credential handling, and approved data-handling practices.
Translate technical telemetry into practical insights for infrastructure teams and leadership (service health, risk trends, capacity indicators etc.).
Collaborate with application owners and campus IT teams to improve visibility into service dependencies and user impacting issues.
Provide hands-on systems administration for campus and data center network management platforms, including Cisco Catalyst switching environments and Cisco Nexus Dashboard.
Implement and maintain configuration management practices: backups, version control, golden configurations, drift detection, and standardized deployment patterns for Catalyst and Nexus environments.
Enable observability outcomes by integrating network telemetry with the enterprise monitoring/observability platform(s) (e.g., Datadog, Splunk Observability, Dynatrace, SolarWinds), including SNMP polling/traps, syslog, NetFlow/IPFIX (where applicable), and streaming telemetry
Normalization of naming/tagging conventions for campus and data center devices to support accurate service maps, dashboards, and incident triage
Support high availability and resilience by managing platform health, capacity planning, backups/restore testing, and continuity procedures for infrastructure management, monitoring, alerting, and observability services
Administer lifecycle operations for network infrastructure and management tooling, including software/firmware upgrades, image standardization, patching, and coordinated maintenance windows aligned with institutional change management practices.
Produce clear documentation and training materials to support adoption and consistent operational practices.
If you meet these qualifications, please submit your application via link provided in Linkedin
Kindly do not call the general line to submit your application.