SRE+Dynatrace - Guadalajara, MX

Remote • Posted 3 days ago • Updated 3 days ago
Full Time
Remote
$60,000 - $80,000/yr
Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

  • SRE
  • Dynatrace

Summary

Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm s most critical, customer-facing microservices that power all eCommerce channels. This role appliesGoogle-inspired SRE principles to balance feature velocity and system reliability using Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.

The role combines software engineering, cloud engineering, automation, and production operations, with a strong emphasis on building systems that are observable, resilient, and operable by default.

Primary Responsibilities:

Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.

Use error budgets to influence release decisions, prioritize reliability work, and manage operational risk.

Design and maintain observability platforms including metrics, logs, traces, and real-time telemetry.

Track, manage, and reduce operational toil by converting repetitive operational work into Jira stories and epics with clear ownership and measurable outcomes.

Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.

Lead incident response, act as an escalation point for high-severity incidents and drive blameless postmortems.

Capture incident action items and reliability improvements in Jira, ensuring closure, accountability, and continuous improvement.

Partner with scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.

Perform deep root cause analysis, debugging, and performance tuning across distributed systems.

Promote shift-left reliability by embedding operability, monitoring, and failure testing early in the SDLC.

Drive continuous improvement through automation, self-healing systems, chaos engineering, and capacity planning.

Maintain runbooks, playbooks, and knowledge repositories, linking documentation to Jira tasks to reduce MTTR.

Provide technical leadership and mentoring to junior SREs and engineers.

Collaborate with global, distributed teams, leveraging Jira for transparent planning, dependency tracking, and execution.

Core Competencies & Accomplishments:

4+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.

Hands-on experience with Java/J2EE-based distributed systems. React experience is a plus.

Proven ability to design and operate systems using SLO-driven reliability models.

Experience defining and measuring SLIs (availability, latency, error rates, throughput, saturation).

Good understanding with NoSQL technologies and RDBMS. Should be able to write queries to fetch results from database.

Experience deploying and operating services on cloud platforms (AWS, Azure, or Google Cloud).

Expertise with observability, APM, and caching tools (Dynatrace, Splunk, ELK, Akamai, Quantum Metric/Tealeaf, etc.).

Strong experience using Jira for backlog management, incident follow-ups, toil reduction tracking, and cross-team coordination.

Ability to independently own services and drive reliability initiatives end-to-end.

Strong communication skills and ability to influence engineering and product teams.

Experience being on On-Call rotation and handling critical/high incidents.

Desired Skills:

Experience building and operating microservices architectures using Spring Boot, Groovy, React, or similar.

Strong understanding of CI/CD pipelines, release automation, and progressive delivery.

Experience with eCommerce domains such as Catalog, Customer Data, and Order Management.

Familiarity with search platforms (Endeca, Solr, Lucene, Elasticsearch).

Proficiency in scripting and automation (Python, Bash, Ruby, Perl, PowerShell).

Experience with ITSM tools integrated with Jira workflows.

Exposure to capacity planning, load testing, and chaos engineering.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10462843
  • Position Id: 8915930
  • Posted 3 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote or Woonsocket, Rhode Island

Today

Full-time

USD 118,450.00 - 260,590.00 per year

Remote

Today

Full-time

USD 87,100.00 - 157,450.00 per year

Remote

Today

Full-time

USD 73,450.00 - 132,775.00 per year

Remote or New York, New York

Today

Full-time

Search all similar jobs