DevOps & Reliability Engineering Lead || 100% Remote

Overview

Remote
Depends on Experience
Contract - Independent
Contract - W2

Skills

Amazon Web Services
Bash
Boost
Budget
CHAOS
Capacity Management
Cloud Computing
Computer Networking
Computer Science
Continuous Delivery
Continuous Integration
Dashboard
DevOps
Docker
Documentation
Git
GitHub
Productivity
Python
Regulatory Compliance
Reliability Engineering
Scripting
Linux
Management
Mentorship
Modeling
Operational Excellence
Optimization
Grafana
High Availability
Incident Management
Jenkins
Kubernetes
Leadership
Terraform
Stacks Blockchain
Workflow
Service Level Management
Service Level
Ansible

Job Details

Job Title: DevOps & Reliability Engineering Lead (IT Application Solutions Architect Senior)

100% Remote


Job Overview

Reliability Engineering: Define and maintain service-level objectives (SLOs), implement error budgeting, and lead incident response and postmortem analysis.

Infrastructure Automation: Use Terraform, Ansible, and other IaC tools to create secure, scalable, and repeatable environments.

CI/CD Optimization: Architect secure and efficient pipelines (e.g., GitHub Actions, Jenkins), incorporating automated rollback, canary/blue-green deploys, and artifact validation.

Observability: Build dashboards, alerts, synthetic checks, and telemetry pipelines that ensure visibility into system performance, availability, and cost.

Security & Compliance: Integrate security tooling (SAST, DAST, SBOM, secrets scanning) and enforce policy-as-code in deployment workflows.

Cost & Capacity Planning: Implement tooling and practices to monitor cloud cost trends, right-size infrastructure, and ensure high availability at optimal spend.

Internal Enablement: Develop reusable internal tools, shared playbooks, and self-service platforms that boost developer productivity and ensure consistent delivery.

Mentorship & Leadership: Serve as a technical mentor across platform, security, and engineering teams. Establish best practices in operational readiness, fault tolerance, and secure delivery

Required Qualification:
Bachelors degree in Computer Science, Engineering, or related technical discipline.

At least 5 years of experience in DevOps, SRE, or Platform Engineering roles with leadership experience in automation and infrastructure reliability.

3+ years hands-on experience in high-availability production environments with cloud-native security and observability tooling.

Deep expertise in AWS (or equivalent cloud platform), especially in compute, networking, IAM, and monitoring.

Proficiency in Terraform, CloudFormation, Kubernetes, Docker, and Linux systems.

Strong knowledge of observability stacks (Prometheus, Grafana, ELK, Datadog, CloudWatch).

Experience implementing and managing CI/CD systems with security tollgates and rollback logic.

Strong scripting skills in Python, Go, or Bash for automation and tooling.

In-depth understanding of SRE practices including incident response, SLO/SLA management, chaos engineering, and capacity modeling.

Familiarity with Git and GitOps patterns.

Proven track record of creating shared tooling and documentation that promotes operational excellence.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Isoftech Inc