SRE Engineer

Overview

On Site
Depends on Experience
Accepts corp to corp applications
Contract - W2
Contract - 12 Month(s)

Skills

SRE
IaC
ELK stack
Python
Datadog

Job Details

Role: SRE

Locations: Fort Mill, SC (Fully Onsite)

Duration: 12+ Months Contract

Responsibilities:

  • Design and architect systems that are highly available, scalable, and reliable through collaboration with cross-functional teams.
  • Lead incident response efforts during system outages or performance degradations, coordinating with various teams tquickly diagnose issues and implement solutions. Develop and refine incident management processes.
  • Provide mentorship and guidance thelp develop technical skills and expertise within the team and stakeholders across the organization. Share best practices, provide constructive feedback, and foster a culture of continuous learning and improvement.
  • Drive automation initiatives tstreamline deployment, configuration, monitoring, and maintenance processes. Develop automation tools and frameworks tincrease operational efficiency, reduce manual intervention, and improve reliability.
  • Awareness and understanding of industry trends, emerging technologies, and best practices in site reliability engineering. Evaluate new tools, technologies, and methodologies tenhance system reliability, scalability, and security, and implement them as appropriate within the organization.

Skills:

  • Bachelor s degree (or equivalent) in computer science or related discipline
  • Strong understanding of system architecture principles, including designing scalable, fault-tolerant, and highly available systems..
  • Advanced experience with containerization technologies such as Docker and container orchestration tools like Kubernetes tmanage and scale containerized applications.
  • Expertise in automation tools and Infrastructure as Code (IaC) tautomate deployment, configuration, and management of infrastructure resources using tools like Terraform, Ansible, or Puppet.
  • Expertise in implementing monitoring and alerting platforms using tools like Prometheus, Grafana, or ELK stack.
  • Expertise in facilitating the adoption of observability platforms for logging, metrics, and Application performance monitors.
  • Strong scripting and programming skills in languages such as Python, Go, Ruby Powershell or Shell scripting.
  • Knowledge of database technologies and experience monitoring and alerting on issues is highly sought.
  • Demonstrated ability trespond promptly tincidents, coordinate with cross-functional teams, and lead incident response efforts tresolve issues quickly and minimize downtime.
  • Strong communication and collaboration skills twork closely with all stakeholders.
  • Demonstrated ability tcommunicate technical concepts clearly.
  • Proficient with establishing SLOs, identifying and creating SLIs and Error Budgets

Experience:

  • Advanced proficiency with logging platforms like ELK, Splunk, Graylog, Loggly, Fluentd
  • Proficient with APM s like Dynatrace, NewRelic, Datadog, AppDynamic
  • Build and manage Metric tools like Prometheus, Grafana, or Datadog
  • Proficient with deploying, managing, configuration, and adoption of ServiceNow and ServiceNow modules
  • Proficient building, maintaining and enhancing delivery pipelines with CI/CD tools like Jenkins, GitLab CI/CD, CircleCI, Travis CI, GitHub Actions.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.