DevOps/SRE Engineer

Overview

Hybrid
Depends on Experience
Contract - W2
Contract - 12 Month(s)

Skills

DevOps
scripting
production support
PySpark
Azure

Job Details

Job Title: DevOps/SRE Engineer

Location: Atlanta GA / Seattle WA / Dallas TX (Hybrid/3days onsite a week)

Duration: 12+ Months

Job Description:

Key Responsibilities:

  • Provide production support for data and application pipelines, ensuring high availability and performance.
  • Apply temporary workarounds and long-term solutions to resolve critical incidents and meet SLA requirements.
  • Maintain and execute Standard Operating Procedures (SOPs) to ensure consistent system reliability and issue handling.
  • Monitor and manage production workloads, pipelines, and infrastructure components using SRE principles.
  • Collaborate with development, QA, and operations teams to automate deployments, improve observability, and enhance stability.
  • Participate in on-call support and be available for rotational shifts including weekends.
  • Perform root cause analysis (RCA) for production issues and implement corrective actions.
  • Ensure effective incident response, escalation, communication, and documentation practices.

Technical Skills:

Mandatory

  • Experience in Azure DevOps, GitHub Actions, or Jenkins for pipeline automation and production release processes.
  • Proficiency in Unix/Linux systems and advanced Shell scripting.
  • Working experience with Azure Databricks, PySpark, or similar cloud-native data processing platforms.
  • Sound understanding of cloud infrastructure (preferably Azure), resource optimization, and scaling.
  • Hands-on experience with monitoring tools, log aggregation (e.g., Azure Monitor, Log Analytics), and alerting systems.
  • Good SQL and scripting skills for diagnostics and issue remediation.
  • Strong debugging and performance tuning skills across infrastructure and data layers.
  • Open to working in shifts, including night and weekend support rotations.

Good to Have:

  • Knowledge of Infrastructure as Code (IaC) tools like Terraform or ARM templates.
  • Familiarity with job schedulers and orchestrators (Control-M, Airflow, ADF).
  • Exposure to incident management systems (ServiceNow, PagerDuty, etc.).
  • SRE practices like error budgets, SLIs/SLOs, chaos testing, and runbook automation.

Behavioural Skills:

  • Strong sense of ownership and accountability.
  • Ability to manage time, prioritize tasks, and handle incidents under pressure.
  • Eagerness to learn new technologies and improve existing processes.
  • Excellent communication skills to collaborate across multiple teams.
  • Problem-solving mindset and ability to drive root cause analysis to resolution.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.