Remote Opportunity || Site Reliability Engineer/ Lead || Full Time (Permanent Position)

Overview

Remote
Full Time

Skills

Kubernetes
Docker
Site Reliability Engineer

Job Details

Position : Site Reliability Engineer/ Lead

Location: Remote - USA

Mode of Hire: Full Time

Job Description:

Mandatory Skills: Docker, Kubernetes, Public cloud, AWS or Google Cloud Platform, Python/Shell or any scripting.

  • Proven experience in Technical project management with leading and managing DevOps/Agile projects, ensuring they are delivered on time, within scope, and on budget.
  • Ensure high customer connect while building processes for all relevant team members to engage with the customer.
  • Collaborate with stakeholders to define, measure, and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
  • Work closely with cross functional teams to plan, design, and implement reliability improvements and automation initiatives.
  • Facilitate post-incident reviews (PIRs), ensuring action items are identified and followed through.
  • Ensure that development, staging, and production environments are correctly configured.
  • Drive initiatives to automate manual tasks and improve system observability and monitoring. Facilitate knowledge sharing across teams to ensure best practices are followed and operational knowledge is captured.
  • Ensure all dependencies (libraries, services, etc.) are installed and compatible. Compile the code and create build artifacts.
  • Use containerization (e.g., Docker) to package applications with their dependencies. Ensure compliance with organizational security policies.
  • Decide on a deployment strategy (e.g., blue-green deployment, canary releases, rolling updates). Define rollback procedures in case the deployment fails.
  • Ensure changes are reviewed, approved, and documented.
  • Use CI/CD pipelines (e.g., Jenkins, GitLab CI, CircleCI) to automate the build, test, and deployment process.
  • Ensure tests (unit, integration, end-to-end) are passing before deploying to production.
  • Implement automated rollback mechanisms to revert to the previous stable version in case of a failed deployment.
  • Ensure load balancers are properly configured to distribute traffic evenly across instances.
  • Maintain up-to-date deployment playbooks, runbooks, and architecture diagrams. Continuously refine deployment processes based on this feedback.

PSRTEK is a reputed technology recruitment and IT staffing brand with a global footprint and an admired client base. As an ideas and innovation powerhouse with a culture of excellence, we bring remarkable expertise and deliver powerfully transformative results.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.