Job Title: IT Developer / Senior Site Reliability Engineer (SRE – Agentic Operations)
Location: Dallas, TX OR San Francisco, CA (Remote)
Work Schedule : Monday to Friday [08:00 AM – 05:00 PM (PST)]
Role Overview:
We are looking for a Senior Site Reliability Engineer with strong development expertise to help evolve a traditional production support model into an automation-first, AI-assisted reliability platform.
This role operates at a senior/staff level, focusing on reliability across distributed systems rather than a single application. You will work on modernizing operations following migration to Microsoft Azure, combining core SRE practices with agentic AI-driven automation.
Unlike conventional SRE roles, this position emphasizes building intelligent, multi-agent systems that enhance incident response, system reliability, and operational efficiency—while ensuring humans remain accountable for critical decisions.
How This Role Differs from Traditional SRE
- Begins with mastering manual SRE workflows
- Progressively introduces AI-assisted automation
- Moves toward semi-autonomous operational systems
- Maintains human-in-the-loop control for production-critical actions
Key Responsibilities
1. Production Reliability & Operations
- Design, manage, and optimize highly available production systems in Azure
- Participate in and lead on-call rotations and critical incident response
- Conduct deep root cause analysis and lead blameless post-incident reviews
- Define and maintain SLIs, SLOs, and observability frameworks
- Monitor systems using tools like Dynatrace dashboards, alerts, and tracing
- Troubleshoot issues across:
- Java-based services
- Kubernetes clusters
- Cloud infrastructure
- Collaborate with engineering, platform, and security teams to reduce risk and operational overhead
- Ensure adherence to security, compliance, and regulatory standards (e.g., HIPAA)
2. AI-Driven (Agentic) Operations
This is the core differentiator of the role.
< data-start="2431" data-end="2481">
Incident Intelligence & Signal Processing>
- Build systems that ingest and correlate:
- Logs, metrics, traces
- Alerts and monitoring signals
- Support tickets and escalation data
- Convert raw signals into structured, actionable incident insights
< data-start="2704" data-end="2745">
Automated Triage & Investigation>
- Develop AI agents that:
- Analyze telemetry and system changes
- Identify probable failure points
- Recommend next actions
- Implement parallel/multi-agent workflows for faster diagnosis
< data-start="2954" data-end="2992">
Remediation & Safe Automation>
- Design automation for controlled actions such as:
- Service restarts
- Scaling operations
- Rollbacks and feature toggles
- Ensure all production-impacting actions follow:
- Predefined guardrails
- Approval workflows (human-in-the-loop)
- Gradually evolve from advisory systems → semi-autonomous execution
< data-start="3334" data-end="3383">
Communication & Post-Incident Automation>
- Build agents that:
- Generate incident updates for stakeholders
- Draft post-incident reports
- Standardize communication across teams
- Ensure outputs are auditable, consistent, and production-ready
Technology Environment
Core Stack
- Cloud: Microsoft Azure
- Containers: Kubernetes, Docker
- Backend: Java-based services
- CI/CD: GitHub Actions
- Observability: Dynatrace
Automation & Scripting
AI & Automation Frameworks
- Microsoft Agent Framework
- Azure-hosted AI agents
- Multi-agent orchestration systems
- Human-in-the-loop safety models
Required Qualifications:
- 7+ years of experience in Site Reliability Engineering / Production Engineering
- Strong hands-on expertise in:
- Azure cloud platform
- Kubernetes & containerization
- Java production environments
- CI/CD pipelines (GitHub Actions)
- Observability tools (Dynatrace preferred)
- Proven experience in automation of infrastructure and operations
- Deep understanding of:
- SLIs, SLOs, and error budgets
- Reliability engineering principles
Preferred Qualifications:
- Experience reducing on-call toil through automation
- Exposure to AI/ML-driven operational systems or agent-based workflows
- Knowledge of multi-agent architectures or distributed automation
- Strong judgment in:
- Risk management
- Safety boundaries
- Human-in-the-loop systems
- Background in healthcare or regulated environments