Overview
Hybrid
Depends on Experience
Contract - W2
Contract - 78 week(s)
Skills
PYTHON
Job Details
About the Role
We’re looking for a Site Reliability Engineer (SRE) to join our engineering team and help build scalable, reliable systems while driving observability and service performance improvements. You’ll collaborate closely with software engineers, data scientists, and DevOps practitioners to enhance service reliability and efficiency across the platform. Your work will play a key role in improving SLAs, tracking SLOs/SLIs, and driving DORA metric outcomes.
Key Responsibilities
- Design and implement highly available, low-latency, and observable systems and infrastructure components.
- Build tools and dashboards for visualization, tracing, and optimization to enhance system reliability and performance.
- Monitor and drive improvements across DORA metrics (deployment frequency, lead time, mean time to recovery, and change failure rate).
- Establish and maintain SLA, SLO, and SLI definitions and processes in collaboration with service owners.
- Participate in on-call rotations and lead incident response processes with a focus on continuous improvement and postmortems.
- Collaborate cross-functionally to identify system bottlenecks and propose architecture or code-level changes.
- Champion DevOps practices such as CI/CD, automated testing, and infrastructure-as-code.
- Review pull requests and offer guidance to uphold high standards in code quality and reliability.
Basic Qualifications
- 5+ years of software engineering experience, ideally in reliability-focused or DevOps-heavy environments.
- Strong coding skills in Python and at least one strongly typed language (e.g., TypeScript, Java).
- Proficiency with AWS core services (e.g., IAM, S3, Lambda, Kinesis, SNS).
- Experience with observability tools (e.g., OpenTelemetry, Datadog, Prometheus, Grafana, Honeycomb, etc.).
- Practical knowledge of CI/CD pipelines, Docker, and system automation.
- Familiarity with infrastructure-as-code tools like Terraform, AWS CDK, or CloudFormation.
- Working knowledge of distributed systems and trade-offs across SQL/NoSQL storage solutions.
Preferred Qualifications
- Hands-on experience implementing and tracking SLAs/SLOs/SLIs.
- Familiarity with performance profiling, distributed tracing, and root cause analysis.
- Experience implementing practices that improve DORA metrics.
- Exposure to real-time data infrastructure or event-driven architecture.
- Prior participation in an on-call rotation or incident management lifecycle.
Who You Are
- You’re passionate about building stable, efficient, and observable systems.
- You’re proactive in identifying reliability risks and driving solutions.
- You balance engineering excellence with pragmatic, operational solutions.
#INDCEI
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.