SRE ( Site Reliability Engineer )
Hybrid in Alpharetta, GA, US • Posted 5 hours ago • Updated 5 hours ago

Clarkstech
Dice Job Match Score™
🤯 Applying directly to the forehead...
Job Details
Skills
- Telemetry
Summary
We are seeking Site Reliability Engineers (SRE) mandatory, hands-on expertise in telemetry, observability, and site monitoring platforms. This is a hybrid contract role based in Alpharetta, GA or Berkeley Heights, NJ.
This role requires proven, production-level experience with enterprise observability stacks.
Key Responsibilities
- Design, implement, and maintain comprehensive telemetry and observability solutions across distributed enterprise systems with complex architectures.
- Build, optimize, and scale real-time monitoring dashboards, metrics pipelines, and intelligent alerting systems using industry-standard tools including Datadog, Splunk, Prometheus, Grafana, ELK Stack, and similar platforms.
- Implement end-to-end observability strategies encompassing metrics, logs, traces, and events to ensure complete system visibility.
- Develop and maintain custom instrumentation for applications and infrastructure to capture critical telemetry data.
- Collaborate with engineering teams to embed reliability practices and ensure systems are resilient, observable, and performant.
- Automate monitoring workflows, alert management, and reliability tasks using Python, Shell, or Go scripting.
- Lead incident response efforts: rapidly identify, troubleshoot, and resolve production issues using observability data and telemetry analysis.
- Design and implement SLOs/SLIs, error budgets, and reliability KPIs with corresponding monitoring and alerting for mission-critical services.
- Develop self-healing and auto-remediation capabilities leveraging observability insights.
- Partner with DevOps, Cloud, and Security teams to integrate observability into CI/CD pipelines and optimize infrastructure reliability.
- Conduct post-incident reviews with detailed telemetry analysis and drive systemic improvements.
Mandatory Skills & Qualifications
Telemetry & Observability (MANDATORY)
Candidates MUST demonstrate hands-on, production experience with the following:
- Observability Platforms (REQUIRED): Deep expertise in at least TWO of the following:
- Datadog (metrics, APM, logs, traces)
- Splunk (log aggregation, search, alerting, dashboards)
- Prometheus (time-series metrics, PromQL, alerting rules)
- Grafana (visualization, dashboard creation, data source integration)
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Telemetry & Monitoring Fundamentals (REQUIRED):
- Building and maintaining metrics collection pipelines
- Log aggregation, parsing, and analysis at scale
- Distributed tracing and application performance monitoring (APM)
- Creating actionable alerts with proper signal-to-noise ratios
- Dashboard design for real-time system health visualization
- Metrics instrumentation and custom telemetry implementation
- Observability Best Practices (REQUIRED):
- Implementing the three pillars of observability: metrics, logs, and traces
- Correlation of telemetry data across multiple sources
- Establishing observability for microservices and distributed systems
- Capacity planning using historical telemetry data
- Performance baselining and anomaly detection
Core SRE Requirements (MANDATORY)
- 4-8 years of professional experience in Site Reliability Engineering or DevOps roles with significant focus on observability
- Proven track record in incident management and on-call support in enterprise production environments, using observability tools for rapid diagnosis
- Proficiency in Linux system administration, networking, and performance tuning
- Hands-on experience with cloud platforms (AWS, Azure, or Google Cloud Platform) including cloud-native monitoring solutions (CloudWatch, Azure Monitor, Google Cloud Platform Operations)
- Solid programming/scripting skills in Python, Bash, Go, or equivalent for automation and tooling
- Familiarity with container orchestration (Kubernetes, Docker) and monitoring containerized environments
- Experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI) with integrated monitoring and observability
Nice-to-Have Skills
- AIOps and intelligent monitoring: Experience with ML-based anomaly detection, predictive monitoring, and automated incident correlation
- OpenTelemetry: Implementation experience with OpenTelemetry for standardized observability instrumentation
- Infrastructure-as-code:
Terraform, Ansible, Pulumi with monitoring-as-code practices - Security observability: Integration of security monitoring, SIEM tools, and compliance frameworks with observability stacks
- Advanced telemetry tools: Experience with Jaeger, Zipkin, New Relic, AppDynamics, Dynatrace, or other specialized APM/observability platforms
- Custom metrics exporters: Development of Prometheus exporters or custom telemetry agents
- Cost optimization: Experience optimizing telemetry data retention and observability platform costs
Engagement Rules
- Contract Position (W2 only) No C2C, No Agencies
- Number of Positions 4 (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience)
- Experience requirement: 4-8 years with mandatory telemetry/observability expertise
- Multi-year contract with annual extensions
- Dice Id: 91165214
- Position Id: 8859859
- Posted 5 hours ago
Company Info
About Clarkstech
At ClarksTech, we are a renowned global IT consulting firm committed to collaborating with business and societal leaders in overcoming their most critical challenges and seizing their greatest opportunities. Our achievements are rooted in fostering deep collaboration and cultivating a global community of diverse individuals who are dedicated.
We have highly skilled engineers with excellent technical knowledge and experience in using the latest software standards. We have built a large pool of knowledge that we apply to deliver solutions that meet client’s needs, expectations and budget.
Similar Jobs
It looks like there aren't any Similar Jobs for this job yet.
Search all similar jobs