Overview
Skills
Job Details
Develop and automate solutions using scripting languages such as Python, Go, or Bash.
Implement and maintain Infrastructure-as-Code solutions using Terraform, Ansible, or similar tools.
Define, measure, and improve SLIs, SLOs, and SLAs to ensure service reliability.
Reduce operational TOIL through automation and long-term engineering improvements.
Integrate and enhance observability platforms for system insight and proactive monitoring.
Participate in incident response, root-cause analysis, and post-mortem processes.
Collaborate closely with application, engineering, and business teams to deliver reliable solutions.
Required Skills & Experience:
Strong background in Site Reliability Engineering, DevOps, infrastructure, or software engineering.
Hands-on experience with cloud platforms, especially Microsoft Azure.
Proficiency working with Linux (RHEL 7+) and Windows Server (2019+) environments.
Solid understanding of networking fundamentals and storage technologies (NFS, SAN, NAS).
Strong working knowledge of authentication and directory services such as DNS, LDAP, Kerberos, Centrify.
Practical experience with automation, scripting, and configuration management.
Experience integrating and maintaining monitoring/observability tools to support system uptime.
Calm, structured approach to managing high-pressure incidents and outages.
Excellent communication and collaboration skills.
Ownership mindset with a drive to continually improve systems and reduce manual effort.