Location: Charlotte, NC
Salary: $69.00 USD Hourly - $74.00 USD Hourly
Description: Software Engineer IV - Observability & Automation (SRE / Production Operations)Locations: Charlotte, NC | Irving, TX | Chandler, AZ
Work Model: Hybrid (3 days in office required)
Employment Type: Contract (18 months, potential extension or conversion)
On-Call: Yes (24x7 rotation)
About the RoleWe are seeking a
senior, hands-on Software Engineer to support and evolve large-scale application and middleware platforms with a
Site Reliability Engineering (SRE) mindset. This role focuses on
production reliability, observability, and automation, shifting operations from reactive support to proactive, engineered reliability.
You will serve as an
L2/L3 escalation point for mission-critical systems, owning incident response, problem management, and runbook-driven operations. You'll also build automation, infrastructure-as-code, and observability solutions that reduce toil, improve MTTR, and increase platform stability across
VM-based and container-adjacent environments, including
OpenShift (OCP).
This role supports a fast-growing platform portfolio (200+ applications, scaling rapidly) and requires strong architectural understanding, technical depth, and the ability to adapt across technologies.
What You'll Do- Act as a senior escalation point for L2/L3 production incidents, leading troubleshooting, recovery, and stabilization of application and middleware services.
- Apply SRE practices daily: define and improve reliability signals, enhance alert quality, conduct blameless post-incident reviews, and prioritize systemic fixes over manual work.
- Design and operate observability solutions (logs, metrics, traces, dashboards, and actionable alerts) to improve detection, diagnosis, and recovery times.
- Build and maintain automation and infrastructure-as-code to support repeatable, audited, and resilient operations across VM and container-adjacent platforms.
- Develop standardized operational automation (status checks, start/stop/restart patterns) to reduce dependency bottlenecks and enable safe self-service.
- Implement intelligent automation (including AI-assisted operations where appropriate) with strong guardrails for accuracy, security, and compliance.
- Monitor and remediate configuration drift; support automated compliance validation aligned with enterprise risk and change management.
- Integrate infrastructure and operational automation into CI/CD pipelines for safer, consistent rollouts.
- Support shared platform components such as ingress, load balancing integrations, and common middleware services.
- Create and maintain runbooks, operational documentation, and validation procedures to ensure consistent execution and operational readiness.
- Participate in an on-call rotation supporting 24x7 production operations.
Minimum Qualifications- 5+ years of experience in software engineering, systems engineering, or production operations, or equivalent practical experience.
- Hands-on experience supporting production applications or middleware in complex, highly available environments.
- Strong troubleshooting skills with the ability to understand system architecture, capacity constraints, and failure modes.
- Experience with automation or scripting (e.g., Python, Bash, PowerShell, or similar).
- Experience working in Linux and/or Windows Server production environments.
- Familiarity with Git-based workflows and infrastructure or configuration as code.
- Ability to learn new technologies quickly and adapt across diverse platforms.
Preferred Qualifications- Experience supporting container-adjacent or Kubernetes-based platforms, including OpenShift (OCP).
- Experience implementing SRE operating practices (reliability metrics, alert engineering, toil reduction).
- Experience with observability platforms (e.g., Splunk, Elastic, or similar) beyond a single tool.
- Experience with automation frameworks (Ansible or equivalent).
- Experience integrating operations with CI/CD pipelines.
- Exposure to responsible AI usage in operations (automation assistance, predictive signals, guarded remediation).
- Strong communication skills and experience working in regulated or enterprise environments.
What Success Looks Like- Reduced incident frequency and faster recovery times through better observability and automation.
- Measurable reduction in operational toil and manual intervention.
- Reliable, auditable, and repeatable platform operations at scale.
- Clear, maintainable documentation and runbooks that enable consistent execution.
- Strong partnership with application, infrastructure, and security teams.
Additional Information- Hybrid role with mandatory in-office presence 3 days per week.
- Participation in 24x7 on-call rotation, including nights, weekends, and holidays.
- This role is not eligible for visa sponsorship.
- Relocation assistance is not available.
By providing your phone number, you consent to: (1) receive automated text messages and calls from the Judge Group, Inc. and its affiliates (collectively "Judge") to such phone number regarding job opportunities, your job application, and for other related purposes. Message & data rates apply and message frequency may vary. Consistent with Judge's Privacy Policy, information obtained from your consent will not be shared with third parties for marketing/promotional purposes. Reply STOP to opt out of receiving telephone calls and text messages from Judge and HELP for help.
Contact: This job and many more are available through The Judge Group. Please apply with us today!