Key Skills:
Splunk, PowerShell, or Python, Logs Monitoring, Confluence and SharePoint
Skill Requirements:
• Hands-on experience with IT monitoring tools (e.g., Nagios, Zabbix, Prometheus, Splunk, or similar).
• Understanding of scripting (PowerShell, Python, or Shell) for basic automation tasks.
• Understanding of AIOps concepts and automation frameworks.
• Proficiency in Confluence and SharePoint for status updates and documentation.
• Ability to interpret logs and detect anomalies proactively.
• Familiarity with ITIL processes for incident, problem, and change management.
• Experience using ticketing systems (e.g., ServiceNow, Jira, Remedy).
• Skilled in creating and updating runbooks and SOPs.
• Ability to follow documented procedures accurately.
• Strong attention to detail for maintaining health check reports and incident updates.
• Analytical thinking for quick problem identification and escalation.
• Excellent communication and documentation skills.
• Proactive mindset with a passion for reliability and automation.
• Strong problem-solving and debugging skills.
Preferred:
• ITIL Foundation Certification.
• Experience with anomaly detection, time-series forecasting, and log analysis.
• Basic certifications in monitoring tools or cloud platforms (AWS, Azure).
Key Responsibilities:
• Proactive Monitoring of alerts and detect anomalies from logs.
• Perform daily health checks until full automation and application monitoring are implemented.
• Follow status checks as per existing runbooks.
• Create and update runbooks as needed to reflect current processes.
• Update system health status every 2 hours during the shift in Confluence or SharePoint.
• Acknowledge incidents promptly and route them to the correct team.
• Update incident status every 4 hours for P1/P2 tickets.
• Communicate with users and provide timely updates on their requests.
• Ensure timely acknowledgment, follow-up, and closure of incidents within SLA.
• Complete service tasks on time as per SLA to release queues quickly.
• Work strictly as per SOPs documented by the team.
• Familiarity with incident management processes and ITIL principles.
• Ability to follow documented procedures and create/update runbooks.
• Strong communication and coordination skills.
• Understanding of Confluence, SharePoint, and ticketing systems.
• Implement best practices in ML operations and productionization.
• Ensure compliance with enterprise data security, governance, and regulatory requirements.
• Collaborate with data engineers, analysts, DevOps/SRE teams and business teams to ensure reliability and security
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
- Dice Id: 91098872
- Position Id: 5177-10115-
- Posted 11 hours ago