Senior Monitoring and Observability Lead 17+ years of experience

Hybrid in NY, NY, US • Posted 4 hours ago • Updated 4 hours ago
Full Time
On-site
Company Branding Image
Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

  • Migration
  • Time Series
  • Network Monitoring
  • WMI
  • Data Collection
  • WAN
  • LAN
  • Wireless Communication
  • Firewall
  • Load Balancing
  • Computer Networking
  • Technical Writing
  • Auditing
  • Data Deduplication
  • Mapping
  • Analytics
  • Cost Management
  • Linux
  • Virtualization
  • VMware
  • Hyper-V
  • Storage
  • Backup
  • Encryption
  • RBAC
  • ITIL
  • Security+
  • Microsoft Azure
  • Amazon Web Services
  • Google Cloud
  • Google Cloud Platform
  • KPI
  • Organizational Skills
  • TACACS+
  • RADIUS
  • SSO
  • Communication
  • Evaluation
  • Scalability
  • Meta-data Management
  • ROOT
  • Instrumentation
  • Build Automation
  • Scripting
  • Windows PowerShell
  • Python
  • Onboarding
  • Workflow
  • IT Service Management
  • Routing
  • ServiceNow
  • Hardening
  • Reporting
  • Regulatory Compliance
  • Servers
  • Cloud Computing
  • Leadership
  • Collaboration
  • System Administration
  • Network Administration
  • Cisco
  • Switches
  • Cisco Nexus
  • Configuration Management
  • Version Control
  • Nexus
  • Network
  • Splunk
  • Dynatrace
  • SolarWinds
  • SNMP
  • NetFlow
  • Streaming
  • Normalization
  • Dashboard
  • High Availability
  • Capacity Management
  • Backup Administration
  • Recovery
  • Testing
  • Network Design
  • Management
  • Firmware
  • Microsoft Windows
  • Change Management
  • Documentation
  • Training
  • LinkedIn

Summary

Job Title: Senior Monitoring and Observability Lead 17+ years of experience

Location: New York, NY (Hybrid)

We are currently seeking candidates who meet the following qualification

Mandatory Qualifications

Enterprise Platform Evaluation & Implementation: Ability to evaluate tools such as Datadog, Splunk, Dynatrace, and SolarWinds etc., define selection criteria, and deliver a hands-on implementation plan and migration approach.
Telemetry Fundamentals: Strong understanding of logs/metrics/traces, event correlation, time-series data, and dashboard construction; familiarity with modern instrumentation patterns (Open Telemetry preferred). Infrastructure & Network Monitoring: Practical knowledge of SNMP, syslog, WMI, APIs, and agent-based data collection; comfort monitoring WAN/LAN/Wi-Fi performance, firewall/load balancer signals, and critical service dependencies. Cloud Monitoring: Experience monitoring workloads and services in at least one major cloud (Azure/AWS/Google Cloud Platform), including identity, networking, and compute telemetry.
ITSM / Workflow Integration: Experience integrating monitoring with ticketing, routing, escalation, and knowledge workflows; ability to design severity and ownership models. Documentation & Governance: Ability to write clear technical documentation, standards, and runbooks suitable for institutional and audit needs.
AIOps capabilities such as anomaly detection, dynamic baselining, event deduplication, correlation, and predictive insights.
Service topology mapping, dependency analysis, and service health models (SLIs/SLOs preferred).
Datadog, Splunk Observability, Dynatrace, SolarWinds, or comparable enterprise observability platforms.

Centralized logging and analytics approaches; understanding of retention, indexing/cost management, and governance.
Windows/Linux monitoring, virtualization platforms (VMware/Hyper-V), storage and backup monitoring, network performance and configuration monitoring.
Operational alignment with CIS Benchmarks and secure monitoring practices (least privilege, secrets handling, encryption in transit, RBAC, auditability).
Relevant certifications (preferred, not required): ITIL Foundation, Security+, cloud certifications (Azure/AWS/Google Cloud Platform), vendor observability certifications.
Experience producing executive dashboards and institutional KPI reporting (availability, performance, incident trends, capacity, risk posture).
Ability to analyze complex systems, identify root causes, and implement durable fixes.
Ability to communicate clearly with both technical and administrative audiences.
Strong organizational skills and ability to prioritize competing needs.
Service-oriented mindset aligned to the institution's mission and stakeholder support expectations.
Experience with Cisco enterprise operations tooling and integration such as: TACACS+/RADIUS, SSO, certificate lifecycle, device compliance/drift detection, and automated configuration deployment workflows.
Familiarity with campus scale operational needs (change windows tied to academic schedules, distributed support models, and stakeholder communication).
Duties/Responsibilities:

Provide support for SolarWinds alerting through current integrations, implement upgrades and enhancements, enable features

Design and implement an end-to-end observability approach spanning metrics, logs, traces, and events across on-prem and cloud environments.

Lead hands-on evaluation and implementation efforts for enterprise platforms including Datadog, Splunk

Observability, Dynatrace, and SolarWinds, aligning tool capabilities to institutional requirements (availability, performance, security, scalability, cost).


Build and maintain telemetry collection standards (agent based and agentless), tagging/metadata conventions, and service dependency views to improve root-cause isolation and service health reporting.

Establish durable operating practices for instrumentation, onboarding, configuration management, lifecycle upgrades, and platform reliability.

Implement alerting strategies that prioritize actionable notifications, reduce noise, and improve time-to-detect (MTTD) and time-to-resolve (MTTR).

Develop and tune thresholds, dynamic baselines, anomaly detection, and intelligent event correlation (AIOps) to support 24x7 service reliability.

Support other infrastructure teams in creating runbooks, escalation standards, and response procedures. Role may require occasional support to fix issues hampering alerting and monitoring system

Contribute to post-incident reviews with measurable improvement outcomes such as alert tuning, automation, capacity adjustments, resilience enhancements.

Build automation using APIs and scripting

(PowerShell/Python) to standardize onboarding, reduce repetitive operations, and support self-service dashboards for campus IT teams.

Integrate monitoring and alerting with enterprise workflows such as ITSM ticketing and routing through ServiceNow. Implement observability-as-code practices where feasible for repeatable deployment, configuration drift reduction, and consistent governance.

Partner with CUNY Infrastructure and Security teams to strengthen configuration practices aligned to CIS Benchmarks and other institutional hardening standards.

Build and maintain executive dashboards and reporting that highlight configuration drift, operational risks, and compliance posture relevant to servers, endpoints, network devices, and cloud resources.

Ensure observability agents, collectors, and integrations follow least-privilege access, secure credential handling, and approved data-handling practices.

Translate technical telemetry into practical insights for infrastructure teams and leadership (service health, risk trends, capacity indicators etc.).

Collaborate with application owners and campus IT teams to improve visibility into service dependencies and user impacting issues.

Provide hands-on systems administration for campus and data center network management platforms, including Cisco Catalyst switching environments and Cisco Nexus Dashboard.

Implement and maintain configuration management practices: backups, version control, golden configurations, drift detection, and standardized deployment patterns for Catalyst and Nexus environments.

Enable observability outcomes by integrating network telemetry with the enterprise monitoring/observability platform(s) (e.g., Datadog, Splunk Observability, Dynatrace, SolarWinds), including SNMP polling/traps, syslog, NetFlow/IPFIX (where applicable), and streaming telemetry

Normalization of naming/tagging conventions for campus and data center devices to support accurate service maps, dashboards, and incident triage

Support high availability and resilience by managing platform health, capacity planning, backups/restore testing, and continuity procedures for infrastructure management, monitoring, alerting, and observability services

Administer lifecycle operations for network infrastructure and management tooling, including software/firmware upgrades, image standardization, patching, and coordinated maintenance windows aligned with institutional change management practices.

Produce clear documentation and training materials to support adoption and consistent operational practices.

If you meet these qualifications, please submit your application via link provided in Linkedin
Kindly do not call the general line to submit your application.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91120711
  • Position Id: 2026-6255
  • Posted 4 hours ago

Company Info

About ACS Consultancy Services, Inc.

ACS Consultancy Services Inc. (ACS) is a New York-based consulting firm that specializes in providing technology solutions. The company, which was established in 2011, has received several certifications, including Minority Business Enterprise, Woman Business Enterprise, WOSB, 8a, and NYS/NYC Women Owned (NYS WBE). The founder and President of ACS, Asha Ramrakhiani, has over 20 years of leadership experience working with various New York State agencies.

Leveraging the extensive experience of its leadership team in working with the US Government, ACS offers IT consulting and project-based services to state and federal agencies. The company has been recognized by the Center for Digital Government for its exceptional experience in collaborating with government agencies, having received the "Best Application Serving a department or Agency's Business Needs" award in the Project Excellence category as part of the Best of New York Awards.

ACS provides IT consulting and staff augmentation services to more than 50 clients in the state of New York, connecting them with over 100 local technology professionals with expertise in the latest technologies. The company focuses on providing best-in-class certified local talent for information technology job categories, providing extended local support to ensure that NYS clients receive relevant consulting services without the need for redundant recruitment stages.

ACS is committed to delivering professional consulting support on strategic initiatives and optimal technology solutions to local, state, and commercial customers. The company takes pride in delivering quality services that exceed customer expectations and drive business success. With its strong leadership team and commitment to excellence, ACS has established itself as a leader in the IT consulting industry and is rapidly expanding.

About_Company_OneAbout_Company_Two
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in New York, New York

Today

Easy Apply

Full-time

Austin, Texas

Today

Easy Apply

Full-time

Search all similar jobs