Overview
Skills
Job Details
Job type: Operations Monitoring Engineer
Location: On-site 4 days a week in Fort Worth, TX
Must haves:
- Experience with Dynatrace (Preferably) , Cloud Watch, Data Dog, OR New Relic
- Change management and incident management experience would be great
Job Description:
The Operations Support Engineer will monitor, alert, and support our systems to ensure seamless operations. Ideal candidates will have 3-5 years of experience with Dynatrace, CloudWatch or similar tools, and a solid understanding of cloud architecture and DevOps principles.
Key Responsibilities:
System Monitoring and Optimization: Monitor systems for faults, identify optimization opportunities, and implement tools and process changes to improve monitoring and alerting.
Incident Response and Root Cause Analysis: Work with major incident response teams for escalations and monitoring during major incidents
Qualifications:
Self-Motivated: Ability to define, develop, and execute plans; manage system outages; and handle high-stress situations.
Availability: Able to work in a 24/7 environment and provide on-call support.
Experience: Proven experience interacting at all levels.
Technical Skills:
Bachelor's degree in Computer Science, Information Systems, or Engineering preferred.
Technical certifications or 5+ years in Event monitoring and alerting
Experience with monitoring tools (Dynatrace, CloudWatch, Zabbix, SCOM).
Strong writing skills for documentation.
Proficient in distributed systems/administration (Windows, Unix, Linux, VMWare, etc.).
Knowledge of ITIL best practices (certification is a plus).
Familiarity with SDLC lifecycle.
Experience in SLA/KPI-driven environments.
ServiceNow proficiency.
General scripting/programming skills (Python, Node.js, Ruby, Perl, Bash/sh).
Preferred Qualifications:
Cloud certifications (AWS, Azure, etc.).
Experience with infrastructure as code tools (Terraform, Ansible, etc.).
ITIL V3 or V4 certification.
Advanced technical skills in various operating systems and environments.
Proven ability to improve monitoring and alerting processes