Senior Systems Monitoring Engineer, IT Event Management

Overview

On Site
Full Time

Skills

Adobe AIR
AIM
Optimization
Innovation
System Monitoring
Real-time
DevOps
Data Centers
IT Operations
IT Service Management
Management
IaaS
Network
Analytics
Root Cause Analysis
Machine Learning (ML)
Configuration Management Database
Mentorship
Research
Problem Management
Training
Leadership
Finance
Coaching
Partnership
FOCUS
Sustainability
Insurance
Legal
Dynatrace
ServiceNow
Event Management
Relational Databases
Cloud Computing
Amazon Web Services
Conflict Resolution
Problem Solving
Analytical Skill
Communication
Collaboration
Reporting
Dashboard
Grafana
Microsoft Power BI
Computer Science

Job Details

At Delta Air Lines, connection is at the heart of everything we do and guides our every action. We strive to welcome and care for all our customers during their travels with us and aim to deliver an elevated experience.

Delta is focused on sustaining a strong IT operation, growing our capabilities, and maximizing optimization across each of our tech hubs to elevate the travel experience for our customers and empower our 90,000 Delta people.

We're committed to fostering innovation, and we're excited to invite you to be part of our journey as we shape the future of technology at the world's best airline!

This role will serve as a Senior Systems Monitoring Engineer for Enterprise Monitoring/Observability team. This role requires a strong understanding of Event Management, real-time monitoring and alerting frameworks to prevent issues before they impact services. The successful candidate will work closely with IT Operations, DevOps and Infrastructure teams to support robust and efficient monitoring ecosystem. Experience with Event Management systems such as Moogsoft, PagerDuty AIOPs or other similar tools required.

This position involves instrumenting and supporting enterprise-level solutions the company uses to monitor systems, and applications both in on-prem data centers and the AWS Cloud Environment. The successful candidate will be responsible for configuring dynamic alert correlations, alert mappings and rules as well as integrating with various monitoring and alerting tools and ensuring that critical IT operations are proactively monitored and managed.

Responsibilities include but not limited to:

  • Regularly review existing alerting in place and make recommendations for improvements.
  • Utilize monitoring tools and ITSM tool set to analyze detected monitoring gaps and major incident occurrences.
  • Improve and manage PagerDuty AIOps to capture and process events from various IT systems including Cloud infrastructure.
  • Develop alert correlation solutions utilizing network, server, application performance and log analytics alerts for faster root cause analysis.
  • Define and configure event rules, thresholds and correlation rules to aggregate alerts and prioritize critical events.
  • Define and configure global rules for standard processing of each alert stream.
  • Advise and assist IT teams to configure alert rules and notifications specific to their application or infrastructure alerts.
  • Optimize the alerting process to reduce noise and improve accuracy of alert correlation using AIOPs and Machine Learning.
  • Integrate to ServiceNow Change and CMDB modules to enhance alerting experience.
  • Advise and mentor coworkers on monitoring solutions and tool integrations.
  • Familiarize oneself with Delta's Mission Critical and Mission Vital applications and their functionality, purpose and impact/dependencies with other applications and systems.
  • Collaborate with cross-functional teams to implement Event Management correlation solutions that align with business needs.
  • Responsible for researching and analyzing related data (events, alerts, traps, incidents, logging) to identify trends and gaps, produce reports, and/or make recommendations.
  • Assist with problem management process for incidents causing impact to the business.
  • Provide training for team members and other stakeholders on Event Management best practices.

Benefits and Perks to Help You Keep Climbing

Our culture is rooted in a shared dedication to living our values - Care, Integrity, Resilience and Servant Leadership - every day, in everything we do. At Delta, our people are our success. At the heart of what we offer is our focus on Sharing Success with Delta employees. Exploring a career at Delta gives you a chance to see the world while earning great compensation and benefits to help you keep climbing along the way:

  • Competitive salary, industry-leading pro?t sharing program, and performance incentives
  • 401(k) with generous company contributions up to 9%
  • New hires are eligible for up to 2-weeks of vacation. This is earned for use in the following vacation year (April 1 - March 31)
  • In addition to vacation, new hires are eligible for up to 56 hours of paid personal time within a 12-month period
  • 10 paid holidays per calendar year
  • Birthing parents are eligible for 12-weeks of paid maternity/parental leave
  • Non-birthing parents are eligible for 2-weeks of paid parental leave
  • Comprehensive health bene?ts including medical, dental, vision, short/long term disability and life insurance bene?ts
  • Family care assistance through fertility support, surrogacy and adoption assistance, lactation support, subsidized back-up care, and programs that help with loved ones in all stages
  • Holistic Wellbeing programs to support physical, emotional, social, and financial health, including access to an employee assistance program offering support for you and anyone in your household, free financial coaching, and extensive resources supporting mental health
  • Domestic and International space-available flight privileges for employees and eligible family members
  • Career development programs to achieve your long-term career goals
  • World-wide partnerships to engage in community service and innovative goals created to focus on sustainability and reducing our carbon footprint
  • Business Resource Groups created to connect employees with common interests to promote inclusion, provide perspective and help implement strategies
  • Recognition rewards and awards through the platform Unstoppable Together
  • Access to over 500 discounts, specialty savings and voluntary benefits through Deltaperks such as car and hotel rentals and auto, home, and pet insurance, legal services, and childcare

What you need to succeed (minimum qualifications)

  • A minimum of 5 years of experience in engineering monitoring solutions supporting the Event Management process.
  • Experience with monitoring tools: Dynatrace. Sumo Logic, Moogsoft, PagerDuty AIOps
  • Working experience with ServiceNow and Event Management solutions.
  • Knowledge of relational databases and ability to write queries to support analysis and reporting functions.
  • 2 years of public cloud experience with AWS Services. AWS CloudWatch and Lambda.
  • Demonstratable troubleshooting, problem solving, and analytical skills.
  • Good communication and collaboration skills.
  • Reporting and Dashboard skills using Grafana or Power Bi to present telemetry data.

What will give you a competitive edge (preferred qualifications)

  • Bachelor degree in computer science or related is preferred.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.