Apply Now

IT Developer/Senior Site Reliability Engineer

Remote • Posted 3 hours ago • Updated 3 hours ago

Contract W2

Contract Independent

No Travel Required

Remote

Depends on Experience

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

Site Reliability Engineer
Agentic Operations
Artificial Intelligence
AI
Machine Learning (ML)
Kubernetes
Microsoft
Microsoft Azure
Azure
Java production systems
Java
CI/CD pipelines
GitHub Actions
GitHub
CI/CD
SRE principles
SLIs
SLOs
error budgets
Docker
AI agents
Python
Bash
Ansible
Observability platforms
Dynatrace
Obeservability
triage
comms
PIR agents

Summary

Job Title: IT Developer / Senior Site Reliability Engineer (SRE – Agentic Operations)

Location: Dallas, TX OR San Francisco, CA (Remote)

Work Schedule : Monday to Friday [08:00 AM – 05:00 PM (PST)]

Role Overview:

We are looking for a Senior Site Reliability Engineer with strong development expertise to help evolve a traditional production support model into an automation-first, AI-assisted reliability platform.

This role operates at a senior/staff level, focusing on reliability across distributed systems rather than a single application. You will work on modernizing operations following migration to Microsoft Azure, combining core SRE practices with agentic AI-driven automation.

Unlike conventional SRE roles, this position emphasizes building intelligent, multi-agent systems that enhance incident response, system reliability, and operational efficiency—while ensuring humans remain accountable for critical decisions.

How This Role Differs from Traditional SRE

Begins with mastering manual SRE workflows
Progressively introduces AI-assisted automation
Moves toward semi-autonomous operational systems
Maintains human-in-the-loop control for production-critical actions

Key Responsibilities

1. Production Reliability & Operations

Design, manage, and optimize highly available production systems in Azure
Participate in and lead on-call rotations and critical incident response
Conduct deep root cause analysis and lead blameless post-incident reviews
Define and maintain SLIs, SLOs, and observability frameworks
Monitor systems using tools like Dynatrace dashboards, alerts, and tracing
Troubleshoot issues across:
- Java-based services
- Kubernetes clusters
- Cloud infrastructure
Collaborate with engineering, platform, and security teams to reduce risk and operational overhead
Ensure adherence to security, compliance, and regulatory standards (e.g., HIPAA)

2. AI-Driven (Agentic) Operations

This is the core differentiator of the role.

< data-start="2431" data-end="2481">Incident Intelligence & Signal Processing

Build systems that ingest and correlate:
- Logs, metrics, traces
- Alerts and monitoring signals
- Support tickets and escalation data
Convert raw signals into structured, actionable incident insights

< data-start="2704" data-end="2745">Automated Triage & Investigation

Develop AI agents that:
- Analyze telemetry and system changes
- Identify probable failure points
- Recommend next actions
Implement parallel/multi-agent workflows for faster diagnosis

< data-start="2954" data-end="2992">Remediation & Safe Automation

Design automation for controlled actions such as:
- Service restarts
- Scaling operations
- Rollbacks and feature toggles
Ensure all production-impacting actions follow:
- Predefined guardrails
- Approval workflows (human-in-the-loop)
Gradually evolve from advisory systems → semi-autonomous execution

< data-start="3334" data-end="3383">Communication & Post-Incident Automation

Build agents that:
- Generate incident updates for stakeholders
- Draft post-incident reports
- Standardize communication across teams
Ensure outputs are auditable, consistent, and production-ready

Technology Environment

Core Stack

Cloud: Microsoft Azure
Containers: Kubernetes, Docker
Backend: Java-based services
CI/CD: GitHub Actions
Observability: Dynatrace

Automation & Scripting

Python, Bash, Ansible

AI & Automation Frameworks

Microsoft Agent Framework
Azure-hosted AI agents
Multi-agent orchestration systems
Human-in-the-loop safety models

Required Qualifications:

7+ years of experience in Site Reliability Engineering / Production Engineering
Strong hands-on expertise in:
- Azure cloud platform
- Kubernetes & containerization
- Java production environments
- CI/CD pipelines (GitHub Actions)
- Observability tools (Dynatrace preferred)
Proven experience in automation of infrastructure and operations
Deep understanding of:
- SLIs, SLOs, and error budgets
- Reliability engineering principles

Preferred Qualifications:

Experience reducing on-call toil through automation
Exposure to AI/ML-driven operational systems or agent-based workflows
Knowledge of multi-agent architectures or distributed automation
Strong judgment in:
- Risk management
- Safety boundaries
- Human-in-the-loop systems
Background in healthcare or regulated environments

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90999382
Position Id: 8963267
Posted 3 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Site Reliability Engineer III

Remote

•

22d ago

Job Details: Job Title: Site Reliability Engineer Duration: Long-Term Contract Location: Chicago, IL || Remote (Candidate from CST Zone only) Job Description: Job Responsibilities: Applies software engineering practices to IT operations tasks to maintain a scalable and reliable production environment for running software services create a bridge between development and operations by applying a software engineering mindset to system administration topics.Writing and developing code to automate

Easy Apply

Contract

70 - 80

SRE Platform Engineer

Remote

•

15d ago

Eclaro''s client is searching for a Lead SRE Platform Engineer to drive reliability engineering strategy and execution across critical IT Business Solutions platforms. **This is a for EST remote hours- NO 3RD PARTIES UNABLE TO SUBCONTRACT** What Youll Do Reliability & Observability Leadership Define and mature SRE best practices across cloud and on-prem environments.Design and implement comprehensive monitoring strategies using tools such as:DynatraceDatadogMicrosoft SCOMDevelop dashboards, aler

Easy Apply

Contract

85 - 90

SRE Lead Platform Engineer- Remote

Remote

•

Today

Role Summary As a Lead SRE Platform Engineer, you will drive reliability engineering strategy and execution across critical IT Business Solutions platforms. This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven root cause analysis (RCA). You will serve as the technical lead for SRE practices establishing monitoring standards, improving MELT (Metrics, Events, Logs, Traces) strategy, influencing tool

Contract

75-95/hr

Requirement for SRE Manager

Remote

•

30+d ago

Location : 100 % Remote Duration : 3 months Contract to Hire Need only on 1099 / W2 Site Reliability Engineering Manager SRE Manager to lead a team of reliability engineers responsible for the uptime, performance, and efficiency of the customer-facing platforms. You ll set SLOs and error budgets, build great incident and change practices, and coach engineers to automate everything that can be automated. Responsibilities Lead & grow the team: Hire, coach, and develop SREs; set goals and establis

Easy Apply

Contract

$160,000 - $180,000

Search all similar jobs