DevOps & Reliability Engineering Lead || 100% Remote

Overview

Remote

Depends on Experience

Contract - Independent

Contract - W2

Skills

Amazon Web Services

Bash

Boost

Budget

CHAOS

Capacity Management

Cloud Computing

Computer Networking

Computer Science

Continuous Delivery

Continuous Integration

Dashboard

DevOps

Docker

Documentation

Git

GitHub

Productivity

Python

Regulatory Compliance

Reliability Engineering

Scripting

Linux

Management

Mentorship

Modeling

Operational Excellence

Optimization

Grafana

High Availability

Incident Management

Jenkins

Kubernetes

Leadership

Terraform

Stacks Blockchain

Workflow

Service Level Management

Service Level

Ansible

Job Details

Job Title: DevOps & Reliability Engineering Lead (IT Application Solutions Architect Senior)

100% Remote

Job Overview

Reliability Engineering: Define and maintain service-level objectives (SLOs), implement error budgeting, and lead incident response and postmortem analysis.

Infrastructure Automation: Use Terraform, Ansible, and other IaC tools to create secure, scalable, and repeatable environments.

CI/CD Optimization: Architect secure and efficient pipelines (e.g., GitHub Actions, Jenkins), incorporating automated rollback, canary/blue-green deploys, and artifact validation.

Observability: Build dashboards, alerts, synthetic checks, and telemetry pipelines that ensure visibility into system performance, availability, and cost.

Security & Compliance: Integrate security tooling (SAST, DAST, SBOM, secrets scanning) and enforce policy-as-code in deployment workflows.

Cost & Capacity Planning: Implement tooling and practices to monitor cloud cost trends, right-size infrastructure, and ensure high availability at optimal spend.

Internal Enablement: Develop reusable internal tools, shared playbooks, and self-service platforms that boost developer productivity and ensure consistent delivery.

Mentorship & Leadership: Serve as a technical mentor across platform, security, and engineering teams. Establish best practices in operational readiness, fault tolerance, and secure delivery

Required Qualification:
Bachelors degree in Computer Science, Engineering, or related technical discipline.

At least 5 years of experience in DevOps, SRE, or Platform Engineering roles with leadership experience in automation and infrastructure reliability.

3+ years hands-on experience in high-availability production environments with cloud-native security and observability tooling.

Deep expertise in AWS (or equivalent cloud platform), especially in compute, networking, IAM, and monitoring.

Proficiency in Terraform, CloudFormation, Kubernetes, Docker, and Linux systems.

Strong knowledge of observability stacks (Prometheus, Grafana, ELK, Datadog, CloudWatch).

Experience implementing and managing CI/CD systems with security tollgates and rollback logic.

Strong scripting skills in Python, Go, or Bash for automation and tooling.

In-depth understanding of SRE practices including incident response, SLO/SLA management, chaos engineering, and capacity modeling.

Familiarity with Git and GitOps patterns.

Proven track record of creating shared tooling and documentation that promotes operational excellence.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Isoftech Inc

Share