Senior Monitoring and Logging Specialist

Overview

Remote
On Site
USD85 - USD96
Contract - W2

Skills

Senior Monitoring and Logging Specialist

Job Details

job summary:

We're looking for a Senior Monitoring & Observability Specialist to join our IT team. This role is all about building and managing cutting-edge monitoring solutions for our on-premises infrastructure, network devices, applications, and cloud services.


You'll lead the charge in implementing and optimizing our monitoring systems, creating real-time dashboards, and setting up automated alerts that not only notify but also provide rich context for swift incident resolution. A key focus will be on proactive monitoring and predictive analytics to spot potential issues before they impact services, significantly boosting our system reliability and reducing downtime.


While you'll need to understand how logging platforms support monitoring, your primary responsibility will be on the "what's happening now and what's about to happen" side of things. You'll also play a crucial role in training other teams on monitoring tools, ensuring everyone can leverage our observability insights effectively. If you're a proactive leader passionate about keeping systems healthy and performing optimally, we want to hear from you.




location: Atlanta, Georgia

job type: Contract

salary: $85 - 96 per hour

work hours: 8am to 5pm

education: Bachelors



responsibilities:

As a Senior Monitoring & Observability Specialist, you'll be at the forefront of ensuring our systems run smoothly and reliably. Your main responsibilities will revolve around:



  • Leading Monitoring Implementation: You'll design, deploy, and manage advanced monitoring systems (like Prometheus, Datadog, or Zabbix) to get real-time insights into our infrastructure, applications, and networks, creating clear dashboards and immediate alerts.
  • Enhancing with Logging: While monitoring is key, you'll also optimize our logging platforms (like Splunk or ELK Stack) to ensure they provide the granular data needed to troubleshoot effectively when an alert goes off.
  • Automating Alerts & Incident Response: A big part of your role will be setting up smart, automated alerts that integrate with our incident management tools, aiming to predict issues and get relevant teams the information they need fast.
  • Driving Proactive Insights: You'll implement cutting-edge proactive monitoring techniques and use predictive analytics to identify potential problems before they impact services, helping us reduce downtime significantly.
  • Fostering Collaboration: You'll be crucial in making sure all teams-from infrastructure to applications-can easily access and use monitoring data during incidents, shortening resolution times.
  • Training and Empowerment: A key objective is to train our teams on how to effectively use monitoring tools, helping them understand dashboards, set up alerts, and troubleshoot issues themselves, reducing reliance on specialists.
  • Optimizing System Performance: You'll regularly review and fine-tune our monitoring systems themselves to ensure they're efficient and don't contribute to any performance bottlenecks.
  • Integrating Cloud Monitoring: You'll seamlessly integrate monitoring for our cloud services (AWS, Azure) with our existing tools, providing a unified view of our entire environment.
  • Documenting Everything: Maintaining detailed documentation for all monitoring configurations, alerting processes, and system architectures will be vital for smooth operations and onboarding new team members.


qualifications:

To succeed in this Senior Monitoring & Observability Specialist role, you'll need a strong blend of technical expertise and practical experience. We're looking for candidates with:



  • Educational Background: A Bachelor's degree in Computer Science, IT, or a related field.
  • Extensive Experience: At least 7 years in IT operations, with a significant 4+ years specifically dedicated to advanced monitoring, observability, and proactive incident detection at a senior level.
  • Monitoring Mastery: Proven hands-on experience implementing and managing enterprise-grade monitoring solutions like Prometheus, Datadog, Dynatrace, Zabbix, or Nagios.
  • Logging Acumen: A solid understanding of how logging platforms (e.g., Splunk, ELK Stack) are used to support and enhance monitoring and troubleshooting efforts.
  • Technical Breadth: Strong knowledge of network protocols, operating systems (Linux, Windows), and crucial experience with cloud monitoring methodologies across platforms like AWS, Azure, or Google Cloud Platform.
  • Automation Skills: Proficiency in scripting languages such as Python, Bash, or PowerShell for automating tasks and integrating monitoring tools.
  • Operational Mindset: Familiarity with ITIL principles and incident management, with a clear focus on alert-driven processes.
  • Soft Skills: Excellent analytical, problem-solving, and communication skills, especially in clearly conveying system health and insights.
  • Work Ethic: The ability to work both independently and collaboratively in a dynamic, fast-paced environment.


Bonus points for: A Master's degree, certifications in monitoring tools or cloud platforms, experience with Infrastructure as Code (e.g., Terraform, Ansible) for deploying monitoring, or familiarity with AIOps concepts.




skills: As a Senior Monitoring & Observability Specialist, you'll need a robust set of skills to excel in this role. Here's what we're looking for:



  • Expertise in Monitoring Tools: You'll be highly proficient with leading monitoring solutions such as Prometheus, Datadog, Dynatrace, Zabbix, and Nagios, demonstrating a deep understanding of their implementation, configuration, and management.
  • Observability Acumen: You'll know how to leverage metrics, traces, and logs to gain comprehensive visibility into system performance and health, with a strong emphasis on proactive insights.
  • Logging Proficiency: While the role is monitoring-centric, you'll have a solid grasp of logging platforms like Splunk or the ELK Stack, understanding how they integrate with and enhance monitoring efforts for troubleshooting.
  • Cloud Monitoring: You'll be skilled in monitoring cloud environments (AWS, Azure, Google Cloud Platform), capable of integrating cloud-native tools with existing on-premises solutions for a unified view.
  • Automation & Scripting: Strong scripting abilities in languages like Python, Bash, or PowerShell are crucial for automating monitoring tasks, alerts, and data collection.
  • Incident Management: You'll have experience with incident management frameworks (e.g., ITIL) and be adept at setting up automated alerting rules that drive efficient incident response.
  • Problem-Solving & Analytics: You'll possess excellent analytical skills, enabling you to interpret complex monitoring data, identify trends, and troubleshoot performance bottlenecks effectively.
  • Communication & Collaboration: Strong communication skills are essential for documenting processes, training teams, and facilitating effective collaboration during incidents.
  • System Optimization: You'll be skilled in optimizing monitoring systems themselves, ensuring they operate efficiently without consuming excessive resources.




Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact

Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).

This posting is open for thirty (30) days.


It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.



Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.