Run Engineer (Site Reliability Engineer) - AWS & AI Expertise - ADDSOURCE

Overview

Remote

Hybrid

Accepts corp to corp applications

Contract - Long Term

Skills

Python

Amazon Web Services

Operations

DEV OPS

Amazon Elastic Compute Cloud

Terraform

Metrics

APM

Scripting

Documentation

Performance Tuning

Kubernetes

AWS Cloudformation

ECS

Site Reliability Engineer

Prometheus

Grafana

HIPAA

Workflow

Governance

Distributed Systems

change control

finance

Refining

Incident Response

Incident Management

AWS CloudWatch

AWS Certified

Clarify

IT Service Management

Operational Support

Reliability Engineering

Job Details

Position: Run Engineer (Site Reliability Engineer) AWS & AI Expertise
Location: United States Preferred (Remote )

Role Overview:
We are seeking an experienced Run Engineer (Site Reliability Engineer) with deep expertise in
AWS operations and familiarity with AI-powered platforms to lead operational readiness for
enterprise-scale solutions.
The role focuses on ensuring platforms are production-ready, resilient, and fully
supportable-covering infrastructure, application, security, and compliance operations. You will
define operational models, strengthen monitoring and observability, and ensure seamless
handoffs from build to run teams in alignment with the Launch Ready Plan (LRP).

Key Responsibilities:

Operational Readiness
Lead validation of production readiness for AWS and AI workloads.
Establish operational frameworks and ensure alignment with business SLAs and
IT standards.

Runbooks & Documentation
Develop and maintain runbooks, workflow diagrams, and operational support
models.
Define RACI matrices to clarify ownership across security, network,
infrastructure, application, and compliance teams.

Monitoring & Observability
Collaborate with architecture and DevOps teams to implement monitoring
solutions (CloudWatch, Prometheus, Grafana).
Ensure full visibility into system health, performance metrics, and AI inference
workloads.

Incident Response & Automation
Design scalable incident management, change control, and performance tuning
processes.
Implement automation scripts and tools for operational efficiency (Python, Shell,
Terraform).

Knowledge Transfer & Training
Conduct training sessions and handovers to operational teams.
Contribute to refining and executing the Launch Ready Plan (LRP).

AI Platform Support
Validate AI/ML platform readiness, including LLM hosting, inference pipelines,
and integration with monitoring/incident frameworks.

Required Qualifications
8+ years in Site Reliability Engineering, DevOps, or Infrastructure Operations roles.
Strong hands-on expertise with AWS services (EC2, Lambda, RDS, S3,
CloudFormation, CloudWatch, etc.).
Familiarity with AI/ML workloads and cloud-native AI service operations.
Proficiency in observability, incident management, and automation scripting (Python,
Shell).
Experience producing operational documentation for complex, distributed systems.
Solid understanding of ITSM practices, compliance frameworks (SOC2, HIPAA), and
cloud governance.
Strong communication and collaboration skills across cross-functional teams.

Preferred Qualifications
AWS Certified DevOps Engineer or AWS Solutions Architect (Associate/Professional).
Experience operating in regulated industries (healthcare, finance, public sector).
Exposure to AI observability tools, model monitoring, and alerting frameworks.
Knowledge of service mesh, infrastructure-as-code (Terraform, CloudFormation), and
container orchestration (Kubernetes, ECS).

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Run Engineer (Site Reliability Engineer) - AWS & AI Expertise

Job Details

Share