Observability & Monitoring Engineer

Overview

Full Time

Skills

PERL
Best Practices
Remediation
Problem-Solving
DEV OPS
Continuous Integration/Delivery
Change Management
Javascript
Operations
Amazon Web Services
GCP
Metrics
Jenkins
Puppet
Kubernetes
Terraform
Forecasting
Capacity Planning
Switch Capacity
Solarwinds
APM
Application Performance
IT Infrastructure
Logging Tools
Incident Management
MTTR

Job Details

Job Title: Observability & Monitoring Enginee w/d Solarwinds and DynaTrace Exp
Location: Rancho Cucamonga, CA 5 Days Onsite Role
Duration: Long Term Project

Duties and Responsibilities:
The Monitoring and Observability engineer will be responsible for Designing, configuring, monitoring, implementing, and maintaining our observability solutions and troubleshooting IT systems and applications to ensure optimal performance and reliability. You will work closely with cross-functional teams to identify potential issues and provide innovative insights to optimize system performance, stability, and availability. The engineer will also be responsible for automating alerting and remediation processes to reduce mean time to resolution (MTTR) and improve system uptime.

Mandatory Skills:

  • 3+ years of experience working in the observability, operations, or DevOps domains.
  • Proficient in Observability, monitoring, and logging tools Like Dynatrace, SolarWinds
  • Candidate should have done installation, setting up and configuration on monitoring tools - Like Dynatrace, SolarWinds.

The responsibilities of Integrated Operations, Engineer II include the following:

  • Configure and maintain monitoring and observability tools and systems. Solarwinds & Dynatrace
  • Monitor Server, network infrastructure and application performance metrics, and identify patterns and trends to improve system performance and reliability.
  • Troubleshoot issues and outages, working closely with development and operations teams to identify root causes and develop solutions.
  • Automate alerting and remediation processes to reduce mean time to resolution (MTTR) and improve system uptime.
  • Conduct capacity planning and forecasting to ensure scalability and optimal performance of IT systems and applications.
  • Collaborate with cross-functional teams to support incident management, change management, and problem management processes.

Skills required -

  • Deep understanding of IT infrastructure monitoring and observability best practices.
  • Strong analytical skills, with the ability to analyze large amounts of data and identify patterns and trends.
  • Strong troubleshooting and problem-solving skills, with the ability to quickly diagnose and resolve complex issues.
  • Programming skills in languages such Perl, Shell, or JavaScript.
  • Experience with automation tools such as Ansible, Puppet or Terraform.
  • Experience with container orchestration tools like Kubernetes.
  • Experience with cloud platforms such as AWS, Google Cloud Platform, or Azure.
  • Experience with CI/CD tools like Jenkins.