Overview
Skills
Job Details
Job Title - SRE Engineer Location - Onsite at Fort Mill, SC Duration - 12 Months Contract
Job Overview
We are seeking a hands-on Site Reliability Engineer (SRE) with strong observability experience to join our team onsite in Fort Mill, SC. The ideal candidate will have robust knowledge of SRE principles, advanced experience with observability tools, and a proven track record in production support and automation. You should be capable of building dashboards from scratch, driving process improvements, and collaborating closely with development teams.
Responsibilities-  Build and maintain dashboards using tools such as Grafana, Dynatrace, and ELK to ensure deep visibility into production environments. 
-  Design, implement, and maintain SRE practices, including Error Budgets, SLOs (Service Level Objectives), SLIs (Service Level Indicators), and NFRs (Non-Functional Requirements) to support business reliability objectives. 
-  Lead root cause investigations for incidents and proactively identify and address system anomalies. 
-  Drive the reduction of TOIL by automating repetitive tasks and streamlining processes across the SDLC or IT operations. 
-  Develop, implement, and enhance CI/CD pipelines using Git, GitHub Actions, GitHub Workflows, Jenkins, and similar tools. 
-  Work closely with software engineers to ensure successful releases by improving application design, deployment, and monitoring workflows. 
-  Assess, define, and roll out SRE approaches and solutions for various products while leading the development of SRE dashboards. 
-  Design, develop, and deliver infrastructure automation leveraging Ansible Tower, Terraform, and other Infrastructure-as-Code (IaC) technologies. 
-  Maintain, troubleshoot, and optimize cloud infrastructure, with strong hands-on experience on AWS and container orchestration. 
-  Leverage observability and monitoring platforms (Dynatrace, Splunk, Elastic Stack, SolarWinds DPA) for real-time alerting, monitoring, and issue resolution. 
-  Mentor development teams on SRE best practices and methodologies, and drive continuous improvement initiatives focused on reliability and cost optimization. 
-  Demonstrable hands-on experience with observability tools: Grafana, Dynatrace, ELK (Elastic Stack), and scripting. 
-  Deep understanding and application of core SRE principles: CUJ, SLO, SLI, Error Budgeting, and NFRs. 
-  Experience building (not just consuming) dashboards in observability platforms. 
-  Proficiency in .Net, SQL, React, Python, Ansible Tower, Terraform, Splunk, SolarWinds DPA, and other scripting/programming languages. 
-  Cloud platform expertise (AWS strongly preferred). 
-  Proven experience with CI/CD practices using Git, GitHub Actions, GitHub Workflows, and Jenkins. 
-  Strong knowledge of Infrastructure as Code (IaC) and container orchestration (e.g., Kubernetes). 
-  Production support and root cause analysis experience in high-availability environments. 
-  Strong communication skills, proven ability to collaborate across teams, and a problem-solving mindset. 
-  Familiarity with AIOps principles and automation best practices. 
-  Experience with automation design and implementation to reduce manual effort within SDLC and IT operations. 
-  Leadership in building SRE dashboards and developing error budget frameworks for products. 
-  Experience in driving incident management and on-call processes.