Site Reliability Engineer

Overview

Remote

Depends on Experience

Full Time

Skills

Site Reliability

Data Dog

Dynatrace

Kuberenetes

New Relic

Job Details

Role: SRE Engineer

Job Description:

Mandatory Skills: SRE, Data Dog OR Dynatrace OR SLO (Service Level Objective) OR SLI (Service level indicator) OR Golden Thresholds OR Resiliency OR Reliability OR New Relic OR Kubernetes)

Embrace SRE Position Overview:

SRE with strong expertise in Datadog and proven capabilities in managing production incidents ensuring the reliability, scalability, and performance of our systems while driving improvements in observability, alerting and incident response processes.

Key Responsibilities:

Monitoring and Observability:

Design, implement, and maintain robust observability using Datadog.
Develop & optimize dashboards, telemetry details and alerts to provide actionable insights into system performance and health.
Continuously improve observability across on-premise and cloud infrastructure.

Incident Management:

Lead production incident response efforts, ensuring rapid identification, root cause analysis and resolution.
Develop and maintain incident management processes, including runbooks and post-incident reviews.
Conduct root cause analysis and implement corrective actions to prevent recurrence.
Collaborate with cross-functional teams to minimize downtime and improve system reliability.

Reliability Engineering:

Automate operational tasks to reduce manual intervention and improve system efficiency.
Implement best practices for high availability, fault tolerance, and disaster recovery.
Define and manage SLOs / SLIs.

Collaboration and Communication:

Communicate effectively with stakeholders during incidents, providing clear updates and timelines.
Work closely with cross-functional teams to ensure seamless integration of observability and performance solutions.
Advocate for a culture of reliability and continuous improvement across the organization.

Required Skills and Qualifications:

Technical Skills:

Strong hands-on experience with Datadog for monitoring, alerting, and observability.
Proficiency in managing production systems in on-premise and Azure cloud environment.
Solid understanding of Linux/Unix systems.
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
Proven track record of leading production incident response and resolution.
Ability to perform root cause analysis and implement corrective actions.
Familiarity with Site Reliability Engineering best practices.
Proficiency in scripting and programming languages (e.g., Python, Go, Shell scripting).
Experience with infrastructure-as-code tools (Terraform)
Strong knowledge & experience in managing distributed systems and microservices architecture implementations.

Soft Skills:

Strong analytical and problem-solving skills.
Excellent communication and collaboration abilities.
Ability to work effectively in a fast-paced, dynamic environment.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share