IT Developer/Senior Site Reliability Engineer

Remote • Posted 3 hours ago • Updated 3 hours ago
Contract W2
Contract Independent
No Travel Required
Remote
Depends on Experience
Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

  • Site Reliability Engineer
  • Agentic Operations
  • Artificial Intelligence
  • AI
  • Machine Learning (ML)
  • Kubernetes
  • Microsoft
  • Microsoft Azure
  • Azure
  • Java production systems
  • Java
  • CI/CD pipelines
  • GitHub Actions
  • GitHub
  • CI/CD
  • SRE principles
  • SLIs
  • SLOs
  • error budgets
  • Docker
  • AI agents
  • Python
  • Bash
  • Ansible
  • Observability platforms
  • Dynatrace
  • Obeservability
  • triage
  • comms
  • PIR agents

Summary

Job Title: IT Developer / Senior Site Reliability Engineer (SRE – Agentic Operations)

Location: Dallas, TX OR San Francisco, CA (Remote)

Work Schedule : Monday to Friday [08:00 AM – 05:00 PM (PST)]


Role Overview:

We are looking for a Senior Site Reliability Engineer with strong development expertise to help evolve a traditional production support model into an automation-first, AI-assisted reliability platform.

This role operates at a senior/staff level, focusing on reliability across distributed systems rather than a single application. You will work on modernizing operations following migration to Microsoft Azure, combining core SRE practices with agentic AI-driven automation.

Unlike conventional SRE roles, this position emphasizes building intelligent, multi-agent systems that enhance incident response, system reliability, and operational efficiency—while ensuring humans remain accountable for critical decisions.


How This Role Differs from Traditional SRE

  • Begins with mastering manual SRE workflows
  • Progressively introduces AI-assisted automation
  • Moves toward semi-autonomous operational systems
  • Maintains human-in-the-loop control for production-critical actions

Key Responsibilities

1. Production Reliability & Operations

  • Design, manage, and optimize highly available production systems in Azure
  • Participate in and lead on-call rotations and critical incident response
  • Conduct deep root cause analysis and lead blameless post-incident reviews
  • Define and maintain SLIs, SLOs, and observability frameworks
  • Monitor systems using tools like Dynatrace dashboards, alerts, and tracing
  • Troubleshoot issues across:
    • Java-based services
    • Kubernetes clusters
    • Cloud infrastructure
  • Collaborate with engineering, platform, and security teams to reduce risk and operational overhead
  • Ensure adherence to security, compliance, and regulatory standards (e.g., HIPAA)

2. AI-Driven (Agentic) Operations

This is the core differentiator of the role.

< data-start="2431" data-end="2481">Incident Intelligence & Signal Processing
  • Build systems that ingest and correlate:
    • Logs, metrics, traces
    • Alerts and monitoring signals
    • Support tickets and escalation data
  • Convert raw signals into structured, actionable incident insights
< data-start="2704" data-end="2745">Automated Triage & Investigation
  • Develop AI agents that:
    • Analyze telemetry and system changes
    • Identify probable failure points
    • Recommend next actions
  • Implement parallel/multi-agent workflows for faster diagnosis
< data-start="2954" data-end="2992">Remediation & Safe Automation
  • Design automation for controlled actions such as:
    • Service restarts
    • Scaling operations
    • Rollbacks and feature toggles
  • Ensure all production-impacting actions follow:
    • Predefined guardrails
    • Approval workflows (human-in-the-loop)
  • Gradually evolve from advisory systems → semi-autonomous execution
< data-start="3334" data-end="3383">Communication & Post-Incident Automation
  • Build agents that:
    • Generate incident updates for stakeholders
    • Draft post-incident reports
    • Standardize communication across teams
  • Ensure outputs are auditable, consistent, and production-ready

Technology Environment

Core Stack

  • Cloud: Microsoft Azure
  • Containers: Kubernetes, Docker
  • Backend: Java-based services
  • CI/CD: GitHub Actions
  • Observability: Dynatrace

Automation & Scripting

  • Python, Bash, Ansible

AI & Automation Frameworks

  • Microsoft Agent Framework
  • Azure-hosted AI agents
  • Multi-agent orchestration systems
  • Human-in-the-loop safety models

Required Qualifications:

  • 7+ years of experience in Site Reliability Engineering / Production Engineering
  • Strong hands-on expertise in:
    • Azure cloud platform
    • Kubernetes & containerization
    • Java production environments
    • CI/CD pipelines (GitHub Actions)
    • Observability tools (Dynatrace preferred)
  • Proven experience in automation of infrastructure and operations
  • Deep understanding of:
    • SLIs, SLOs, and error budgets
    • Reliability engineering principles

Preferred Qualifications:

  • Experience reducing on-call toil through automation
  • Exposure to AI/ML-driven operational systems or agent-based workflows
  • Knowledge of multi-agent architectures or distributed automation
  • Strong judgment in:
    • Risk management
    • Safety boundaries
    • Human-in-the-loop systems
  • Background in healthcare or regulated environments
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90999382
  • Position Id: 8963267
  • Posted 3 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

22d ago

Easy Apply

Contract

70 - 80

Remote

15d ago

Easy Apply

Contract

85 - 90

Remote

Today

Contract

75-95/hr

Remote

30+d ago

Easy Apply

Contract

$160,000 - $180,000

Search all similar jobs