Career Opportunity:
Job Title: Principle SRE Engineer
About CodeForce 360
Making a career choice is amongst the most critical choices one can make, and it’s important for the choice to be calculated with factors such as a company’s run of success since its inception and more. But, when you come across a company that has reputation proven with nothing but an illustrious run of success since the day it began, you don’t need to think of anything else. That’s precisely what some of our employees and prospective employees think when they came across CodeForce 360.
Position Overview
Principle SRE Engineer
Requirements:
The type of profile that would be most valuable for us is someone who has personally driven the operationalization of SRE frameworks – not just at a strategic level, but through execution. This would include areas such as:
- Defining and implementing SLIs/SLOs and reliability targets that align with the departments Golden Pathways
- Building and operationalizing observability standards (metrics, logs, traces)
- Designing/evolving existing incident management and RCA practices
- Driving automation and reliability engineering workflows
- Establishing service health dashboards and telemetry pipelines
- Working closely with engineering teams to embed reliability into development and operations
Ideally this would be someone who has stood up or significantly evolved SRE programs in complex enterprise environments and can help accelerate implementation of the practices we are defining.
This role would be very execution-focused – someone comfortable rolling up their sleeves, working with the engineering teams directly, and helping us operationalize the reliability model across our platforms.
- Design and Build Central SRE Operating view
- Implement golden-pathway telemetry across:
i. App Performance Monitoring (APM) – Service response times, transaction bottlenecks
ii. Logging & Tracing -correlated logs, structured tracing
iii. Event & Alerting – actionable event definitions tied to severity
iv. RCA/Tagging Compliance monitoring – auto tagging, and RCA lifecycle ingestion
v. Build executive level Scorecards and dashboards via Grafana and ServiceNow performance analytics:
-
-
-
- Per-app reliability score
- SRE maturity score
- Mean time to detect/respond/restore (MTTx)
- Escalation patterns and failure root trends
- Enable Long-Term SRE Governance
- Establish SRE telemetry ingestion pipelines
- Design alert logic for low-quality signals
- Build RCA tagging enforcement playbooks
- Deliver runbooks and telemetry integration guides per application type
- Centralized SRE Golden Dashboard – Single Pane of Glass
- A central pillar of this initiative is the creation of a Centralized SRE Golden Dashboard serving as a Single Pane of Glass – for executive and operational visibility across all 40 + applications
i. The dashboard will:
-
-
-
- Aggregate key telemetry: reliability metrics, RCA themes, MTTR, incident volumes, tag compliance, alert noise, performance degradation, and resilience scoring.
- Display per-app SRE health scores based on the maturity framework.
- Include dynamic drilldowns into:
- Incident hygiene (tagging, closure quality, RCA ownership)
- SLA/OLAs/SLIs/SLOd/Error budgets cleanly architected
- Alerting trends and noise correlation
- Capacity/resiliency warnings
- Serve as the definitive executive reporting source – used for monthly reviews, CIO/VP visibility, and roadmap investment decisions.
How to Apply
Job ID: JPC - 227332
For more information, please contact below:
Bhushan Reddy
Qualified individuals will be contacted for an interview.