POSITION PURPOSE:
The incumbent will be a subject matter expert on monitoring tools and processes used by the commonwealth and is responsible for collaborating with technical specialists, agency teams, and vendors to implement actionable monitoring and reporting. The position’s responsibilities also include coordinating efforts to transform person-centric processes into structured, repeatable, and highly documented automated workflows. Additionally, this position is responsible for the management and continuous improvement of key enterprise monitoring processes, including changes, incident reporting, and problem resolution. The incumbent will also be responsible for evaluating, preparing, and implementing technical solutions for on-prem and cloud-based applications and technology resources. The position will develop and maintain standard operating procedures (SOPs) and ensure consistent communication strategies to enhance operational efficiency and service delivery.
DESCRIPTION OF DUTIES:
·Drive process and tooling improvements - Identify gaps and implement automation-first practices to reduce manual effort and improve service quality.
·Maintain endpoint monitoring connectivity - Ensure reliable telemetry ingestion via agents, SNMP, WMI, and APIs; manage certificates and credentials across hybrid networks.
·Own documentation and knowledge management - Create and maintain runbooks, SOPs,service maps, and workflows in an organized, version-controlled repository;ensure accessibility and periodic review.
·Document incidents and problems with observability context - Capture monitoring data in ServiceNow tickets; produce post-incident reviews and maintain a Known Error Database.
·Collaborate on change, incident, and problem management - Work with Enterprise Change and Incident Management teams to ensure standardized processes, risk assessments,and communication plans are followed.
·Monitor resolution performance and service restoration - Track SLAs, MTTR, and root cause analysis quality; ensure corrective actions are implemented and validated.
·Standardized communication and stakeholder updates - Implement structured communication workflows for changes, incidents, and problems; manage distribution lists and enable self-service subscription options.
·Ensure compliance with Commonwealth IT policies - Align services with public and enterprise policy objectives; recommend updates to improve reliability,security, and cost efficiency.
·Utilize ServiceNow for change management - Create and track Requests for Change, link risk assessments, and validate post-change monitoring health.
·Provide SLA reporting and operational metrics - Submit accurate data on availability,incidents, and enhancements for monthly/quarterly SLA reports.
·Design and test disaster recovery plans - Define RTO/RPO for network and monitoring infrastructure; execute periodic DR exercises and update plans.
·Maintain technical currency - Stay current with emerging monitoring technologies and best practices; pursue relevant training and certifications.
·Fulfill Continuity of Government (CoG) obligations - Perform assigned duties during CoG activation, including relocation to alternate sites during catastrophic incidents.
·Adhere to IT service management processes - Operate within ITIL-aligned frameworks;contribute to process maturity and compliance audits.
Qualifications
Required
Education/Experience
- 5+ years of experience in IT infrastructure monitoring, automation, and observability in hybrid environments.
- Bachelor's Degree in IT/CompSci or related field
- Technical Skills
- Strong proficiency in PowerShell and at least one other scripting language (e.g., Python, Bash, SQL).
- Hands-on experience with Azure Monitor, Log Analytics, Ansible, SQL, and KQL.
- Experience implementing automation using Azure Automation and CI/CD pipelines.
- Expertise in monitoring platforms such as SCOM, SquaredUp, or equivalent (e.g., Dynatrace, Datadog, Splunk).
- Knowledge of API integration and secure authentication.
- Process & Frameworks
- Working knowledge of ITIL 4 practices (Change, Incident, Problem Management).
- Experience with ServiceNow or similar ITSM platforms.
- Other
- Strong troubleshooting and root cause analysis skills.
- Excellent documentation and communication abilities.
Preferred
- Certifications:
- Microsoft Certified: Azure Administrator Associate or Azure Solutions Architect Expert.
- ITIL 4 Foundation or higher.
- Experience with:
- SquaredUp or equivalent dashboarding tools.
- Disaster Recovery planning and testing.
- Performance tuning and capacity planning for monitoring platforms.
- Familiarity with:
- Security best practices for API and automation scripts.
- Hybrid cloud environments and networking fundamentals.