Overview
On Site
BASED ON EXPERIENCE
Full Time
Contract - Independent
Contract - W2
Skills
JIRA
Confluence
Communication
Kubernetes
Docker
Cloud Computing
Amazon Web Services
Microsoft Azure
Google Cloud Platform
Google Cloud
Apache Cassandra
Middleware
Apache Kafka
Microservices
Analytics
Splunk
AppDynamics
Grafana
Terraform
Ansible
Scripting
Python
Bash
Shell Scripting
Workflow
Performance Analysis
Forecasting
ServiceNow
Elasticsearch
Kibana
Team Leadership
Mentorship
Systems Design
Scalability
High Availability
Incident Management
Dashboard
Collaboration
DevOps
Capacity Management
SLA
Management
Service Level
Continuous Improvement
Reliability Engineering
Documentation
Knowledge Sharing
Job Details
Site Reliability Engineer Lead
Work Location: Frisco, Texas
Required Skills (Bulleted 3-4 most critical only)
" Minimum 10+ years of experience in relevant area.
" Team Leadership: Strong ability to mentor and manage teams using collaborative platforms like Jira, Teams, and Confluence. Excellent communication and collaboration skills.
" System Design and Architecture: Expertise in designing scalable and reliable systems using tools like Kubernetes, Docker, and cloud services (AWS, Azure, Google Cloud Platform). Experience with Kafka, Cassandra, and other infrastructure tools. Familiarity with middleware technologies such as Kafka, APIs, and Microservices architecture.
" Incident Management: Proficiency in managing incidents using tools like PagerDuty, xMatters, alongside conducting effective post-mortems.
" Monitoring and Analytics: Experience with monitoring tools such as Splunk, AppDynamics, Grafana, Prometheus, etc for proactive issue detection.
" Automation: Skilled in using automation tools like Terraform, Ansible, and scripting languages (Python, Bash, ShellScript) to streamline workflows.
" Capacity Planning: Familiarity with performance analysis and forecasting tools to ensure infrastructure scalability.
" SLA/SLO Management: Defining and tracking reliability goals using SRE best practices and tools like ServiceNow.
" Continuous Improvement: Ability to assess system reliability with tools like ELK Stack (Elasticsearch, Logstash, Kibana) and implement enhancements. "
Job description
" Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively.
" System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability.
" Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence.
" Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively.
" Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency.
" Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features.
" Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand.
" SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements.
" Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them.
" Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency. Have a good understanding about APIs.
Work Location: Frisco, Texas
Required Skills (Bulleted 3-4 most critical only)
" Minimum 10+ years of experience in relevant area.
" Team Leadership: Strong ability to mentor and manage teams using collaborative platforms like Jira, Teams, and Confluence. Excellent communication and collaboration skills.
" System Design and Architecture: Expertise in designing scalable and reliable systems using tools like Kubernetes, Docker, and cloud services (AWS, Azure, Google Cloud Platform). Experience with Kafka, Cassandra, and other infrastructure tools. Familiarity with middleware technologies such as Kafka, APIs, and Microservices architecture.
" Incident Management: Proficiency in managing incidents using tools like PagerDuty, xMatters, alongside conducting effective post-mortems.
" Monitoring and Analytics: Experience with monitoring tools such as Splunk, AppDynamics, Grafana, Prometheus, etc for proactive issue detection.
" Automation: Skilled in using automation tools like Terraform, Ansible, and scripting languages (Python, Bash, ShellScript) to streamline workflows.
" Capacity Planning: Familiarity with performance analysis and forecasting tools to ensure infrastructure scalability.
" SLA/SLO Management: Defining and tracking reliability goals using SRE best practices and tools like ServiceNow.
" Continuous Improvement: Ability to assess system reliability with tools like ELK Stack (Elasticsearch, Logstash, Kibana) and implement enhancements. "
Job description
" Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively.
" System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability.
" Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence.
" Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively.
" Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency.
" Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features.
" Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand.
" SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements.
" Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them.
" Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency. Have a good understanding about APIs.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.