Lead Site Reliability Engineer-Remote@Full Time

Overview

Remote
Depends on Experience
Full Time

Skills

Linux
Amazon Web Services
Google Cloud Platform
Shell
CircleCI
Docker
DevOps
Continuous Delivery
Continuous Integration
Jenkins
Kubernetes
Python
Scripting
Service Level

Job Details

Role: Lead Site Reliability Engineer

Job Location : Remote (Anywhere in US)

Job Type: Full time

Job Description:-
Experience Range: 10+ years
Mandatory Skills: Linux, AWS or Google Cloud Platform, Kubernetes, Python/Shell or any scripting.
1. Proven experience in Technical project management with leading and managing DevOps/Agile projects, ensuring they are delivered on time, within scope, and on budget.
2. Ensure high customer connect while building processes for all relevant team members to engage with the customer.
3. Collaborate with stakeholders to define, measure, and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
4. Work closely with cross functional teams to plan, design, and implement reliability improvements and automation initiatives.
5. Facilitate post-incident reviews (PIRs), ensuring action items are identified and followed through.
6. Ensure that development, staging, and production environments are correctly configured.
7. Drive initiatives to automate manual tasks and improve system observability and monitoring. Facilitate knowledge sharing across teams to ensure best practices are followed and operational knowledge is captured.
8. Ensure all dependencies (libraries, services, etc.) are installed and compatible. Compile the code and create build artifacts.
9. Use containerization (e.g., Docker) to package applications with their dependencies. Ensure compliance with organizational security policies.
10. Decide on a deployment strategy (e.g., blue-green deployment, canary releases, rolling updates). Define rollback procedures in case the deployment fails.
11. Ensure changes are reviewed, approved, and documented.
12. Use CI/CD pipelines (e.g., Jenkins, GitLab CI, CircleCI) to automate the build, test, and deployment process.
13. Ensure tests (unit, integration, end-to-end) are passing before deploying to production.
14. Implement automated rollback mechanisms to revert to the previous stable version in case of a failed deployment.
15. Ensure load balancers are properly configured to distribute traffic evenly across instances.
16. Maintain up-to-date deployment playbooks, runbooks, and architecture diagrams. Continuously refine deployment processes based on this feedback.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.