Run Engineer (Site Reliability Engineer) - AWS & AI Expertise

  • Posted 2 hours ago | Updated 1 hour ago

Overview

Remote
Hybrid
Accepts corp to corp applications
Contract - Long Term

Skills

Python
Amazon Web Services
Operations
DEV OPS
Amazon Elastic Compute Cloud
Terraform
Metrics
APM
Scripting
Documentation
Performance Tuning
Kubernetes
AWS Cloudformation
ECS
Site Reliability Engineer
Prometheus
Grafana
HIPAA
Workflow
Governance
Distributed Systems
change control
finance
Refining
Incident Response
Incident Management
AWS CloudWatch
AWS Certified
Clarify
IT Service Management
Operational Support
Reliability Engineering

Job Details

Position: Run Engineer (Site Reliability Engineer) AWS & AI Expertise
Location: United States Preferred (Remote )

Role Overview:
We are seeking an experienced Run Engineer (Site Reliability Engineer) with deep expertise in
AWS operations and familiarity with AI-powered platforms to lead operational readiness for
enterprise-scale solutions.
The role focuses on ensuring platforms are production-ready, resilient, and fully
supportable-covering infrastructure, application, security, and compliance operations. You will
define operational models, strengthen monitoring and observability, and ensure seamless
handoffs from build to run teams in alignment with the Launch Ready Plan (LRP).

Key Responsibilities:

Operational Readiness
Lead validation of production readiness for AWS and AI workloads.
Establish operational frameworks and ensure alignment with business SLAs and
IT standards.

Runbooks & Documentation
Develop and maintain runbooks, workflow diagrams, and operational support
models.
Define RACI matrices to clarify ownership across security, network,
infrastructure, application, and compliance teams.
Monitoring & Observability
Collaborate with architecture and DevOps teams to implement monitoring
solutions (CloudWatch, Prometheus, Grafana).
Ensure full visibility into system health, performance metrics, and AI inference
workloads.
Incident Response & Automation
Design scalable incident management, change control, and performance tuning
processes.
Implement automation scripts and tools for operational efficiency (Python, Shell,
Terraform).
Knowledge Transfer & Training
Conduct training sessions and handovers to operational teams.
Contribute to refining and executing the Launch Ready Plan (LRP).
AI Platform Support
Validate AI/ML platform readiness, including LLM hosting, inference pipelines,
and integration with monitoring/incident frameworks.
Required Qualifications
8+ years in Site Reliability Engineering, DevOps, or Infrastructure Operations roles.
Strong hands-on expertise with AWS services (EC2, Lambda, RDS, S3,
CloudFormation, CloudWatch, etc.).
Familiarity with AI/ML workloads and cloud-native AI service operations.
Proficiency in observability, incident management, and automation scripting (Python,
Shell).
Experience producing operational documentation for complex, distributed systems.
Solid understanding of ITSM practices, compliance frameworks (SOC2, HIPAA), and
cloud governance.
Strong communication and collaboration skills across cross-functional teams.

Preferred Qualifications
AWS Certified DevOps Engineer or AWS Solutions Architect (Associate/Professional).
Experience operating in regulated industries (healthcare, finance, public sector).
Exposure to AI observability tools, model monitoring, and alerting frameworks.
Knowledge of service mesh, infrastructure-as-code (Terraform, CloudFormation), and
container orchestration (Kubernetes, ECS).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.