Site reliability engineering - Senior Engineer

Overview

On Site
Depends on Experience
Full Time

Skills

Amazon Web Services
Ansible
Budget
CHAOS
Capacity Management
Cloud Computing
Collaboration
Computer Networking
Configuration Management
Conflict Resolution
Docker
Documentation
Google Cloud
Google Cloud Platform
Incident Management
Java
Knowledge Sharing
Kubernetes
Microsoft Azure
Optimization
Performance Tuning
Problem Management
Problem Solving
Production Support
Programming Languages
Python
Reliability Engineering
Scripting
Software Architecture
Software Development
Splunk
Terraform

Job Details

Role: Site reliability engineering - Senior Engineer

Location: Jersey City, NJ, 07302 / Edison, NJ. 08817

Duration: Contract

Must have skills:

Python or Java

Splunk Cloud, Thousand Eyes

cloud platforms such as AWS, Google Cloud, or Azure

Docker and Kubernetes

Responsibilities:

  • System Reliability : Work with production support teams to implement scalable, maintainable systems, continuously seeking improvements and optimizations in infrastructure and application architecture.
  • Toil Reduction - Automation : Build and maintain tools and scripts for automating repetitive tasks, deployment processes, monitoring, and incident responses, reducing manual interventions and minimizing human errors.
  • Incident Management : Participate in major incidents (on-call rotations), respond to incidents and service outages, promptly investigate and resolve system issues, and conduct post-mortems to prevent future incidents through Problem management.
  • Monitoring and Alerting : Establish and maintain monitoring and alerting systems to proactively identify potential issues, ensuring timely notifications to relevant teams during critical situations.
  • Capacity Planning and Performance Optimization : Monitor system performance, identify bottlenecks, collaborate with engineering teams for performance optimization, and plan for future growth.
  • Error Budgeting and Chaos Engineering : Diagnose and recommend optimization opportunities, conducts mock drills to improve stability and resiliency.
  • Documentation : Develop and maintain comprehensive documentation for system configurations, processes, and troubleshooting procedures to enhance knowledge sharing and team efficiency.

Minimum Qualifications -

  • Knowledgeable in cloud platforms such as AWS, Google Cloud, or Azure, and familiar with containerization technologies like Docker and Kubernetes.
  • Proficient in using infrastructure-as-code tools like Terraform and Ansible for automation and configuration management.

Preferred Qualifications -

  • Experienced in software development with proficiency in programming languages like Python or Java.
  1. Familiar with monitoring and logging tools such as Splunk Cloud, Thousand Eyes .
  • Understands networking principles and protocols.
  • Capable of working collaboratively in a fast-paced, dynamic environment with excellent problem-solving skills.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Purple Drive Technologies LLC