SRE DevOps Engineer(W2)

Frisco, TX, US • Posted 4 days ago • Updated 4 days ago

Full Time

Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Database
Computer Networking
Conflict Resolution
Artificial Intelligence
API Management
CPU
Amazon Web Services

Summary

SRE DevOps Engineer

Location: Overland Park, KS / Atlanta, GA / Frisco, TX (Onsite)

Requirements

Qualifications

4–9 years in SRE/DevOps/Systems Engineering as Senior or Principal Engineer

Strong hands-on experience with Kubernetes, container orchestration, and API management.

Working knowledge of WAFs,networking security, and database technologies (SQL/NoSQL).

Proficient in automation and scripting (Python, Go,Ansible, Terraform,etc..)

Strong observability/monitoring experience.

Experience with CI/CD pipelines, GitOps, and infrastructure as code.

Solid problem-solving and collaboration skills.

Job responsibilities

Resolve escalated incidents across Kubernetes,API Proxy, WAF,DBs, and infra platforms.

Design and improve runbooks, automating manual steps wherever possible.

Lead and contribute to building self-healing systems and self-service tooling for users.

Analyze incident trends, propose improvements in monitoring, capacity, and reliability.

Collaborate with engineering teams on deployment, upgrades, and performance optimization.

Conduct postmortems, document RCA, and ensure learning is captured.

Mentor and coach L1 engineers.

Skills

Mandatory Skills (Must-Have)

1.Advanced Incident Troubleshooting & Resolution

Expectation: Diagnose and resolve escalated incidents that L1 cannot handle,

often across multiplelayers (infrastructure, application,network).

Example: For an API outage,identify if the root cause is in Kubernetes pod networking,APIgateway misconfig,or backend DB latency — and apply fixes.

2. Kubernetes & Container Orchestration Expertise

Expectation: Comfortable with deployments, scaling,networking, and debugging cluster-level

issues.

Example: Troubleshoot why pods are pending by checking node capacity, taints/tolerations, and

cluster autoscaler logs.

3.Automation & Scripting (Python, Go, Bash,Ansible, Terraform)

Expectation: Write scripts and automation to reduce manual toil,enhance monitoring, and improveincident resolution speed.

Example: Develop a Python script to automatically collect pod and system logs when a service

crashes.

4. Observability & Monitoring Tooling

Expectation: Deep understanding of monitoring, alerting, tracing, and logging systems.

Example: Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.

5. CI/CD & Infrastructure as Code (IaC)

Expectation: Familiarity with GitOps workflows, CI/CD pipelines, and infrastructure provisioning.

Example: Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes

deployments.

6. Database Troubleshooting (SQL & NoSQL)

Expectation: Identify performance bottlenecks, connection issues, and basic tuning opportunities.

Example: Run queries to detect slow-running SQL statements causing latency in an application.

7. Incident Management & RCA

Expectation: Act as incident commander for escalated issues, lead bridge calls, and produce Root

Cause Analyses.

Example: After a WAF misconfiguration causes downtime,lead the investigation, document the

timeline, and propose preventive actions.

8. Mentorship & Runbook Improvement

Expectation: Coach L1 engineers, refine runbooks, and introduce new automated workflows.

Example: Update a runbook to add automated Kubernetes log collection instead of manual steps.

Preferred Skills (Nice-to-Have)

1. Cloud Platform Engineering (AWS,Azure, Google Cloud Platform)

Expectation: Hands-on skills in provisioning, scaling, and securing cloud workloads.

Example: Diagnose why an AWS ALB is misrouting traffic after a deployment.

2. Security & WAF Management

Expectation: Understand WAF rules, common attacks (SQL injection, XSS), and how to apply fixes.

Example: Investigate false positives in WAF logs and adjust rule sets with security teams.

3. Capacity & Performance Engineering

Expectation: Anticipate scaling needs, tune resource utilization, and propose optimizations.

Example: Identify that a Kubernetes deployment is CPU-throttled and adjust HPA (Horizontal Pod Autoscaler) configs.

4.Automation Platform Integration (AIOps, ChatOps)

Expectation: Integrate AI/ML-powered tools for anomaly detection and auto-remediation.

Example: Implement a ChatOps bot that runs predefined Kubernetes troubleshooting commands in Slack.

5. Cross-Platform Expertise (Hybrid Infra)

Expectation: Experience supporting both on-prem and cloud environments seamlessly.

Example: Compare latency patterns between on-prem DBs and cloud-hosted APIs to identify bottlenecks.

Thanks & Regards
Radha | SR US IT Recruiter

Highbrow LLC

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91126058
Position Id: 8853353
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

SRE DevOps Engineer(W2)

Dice Job Match Score™

Job Details

Skills

Summary

SRE DevOps Engineer

Similar Jobs