Overview
On Site
$50 - $70
Contract - W2
Contract - 12 Month(s)
Skills
devops
ai
artificial intelligence
Job Details
About the Role
We re looking for a skilled DevOps Engineer to join our Observability & AIOps team. This role is at the heart of ensuring that Comcast s large-scale distributed systems are reliable, observable, and intelligently automated. You ll design, build, and maintain platforms that provide deep visibility into our services, leverage AI/ML for operational insights, and drive automated incident response.
Key Responsibilities
- Build & Maintain Observability Infrastructure
- Deploy, configure, and manage tools for metrics, logs, and traces (e.g., Prometheus, Grafana, ELK stack, OpenTelemetry, Jaeger, Datadog, Splunk).
- Ensure telemetry data is complete, accurate, and accessible across systems and environments.
- Automation & CI/CD Integration
- Integrate observability tools with CI/CD pipelines (Jenkins, GitLab CI, ArgoCD).
- Automate deployment and scaling of monitoring agents using infrastructure-as-code (Terraform, Ansible, Helm).
- AIOps & Intelligent Alerting
- Collaborate with data scientists and platform engineers to feed clean observability data into AI/ML pipelines.
- Implement anomaly detection, alert deduplication, and predictive maintenance solutions.
- Incident Management & SRE Practices
- Partner with SRE teams to define and monitor SLIs/SLOs.
- Reduce mean time to detect (MTTD) and mean time to resolve (MTTR) through automation and intelligent alerting.
- Contribute to incident response playbooks and post-incident reviews.
- Dashboards & Developer Experience
- Build and maintain custom dashboards that visualize service health and performance.
- Provide self-service observability tools that empower development and operations teams.
- Treat observability as a product, focusing on usability, reliability, and scalability.
Qualifications
- 3+ years of experience as a DevOps Engineer, SRE, or similar role in large-scale, cloud-based environments.
- Solid knowledge of observability concepts (metrics, logs, traces) and tools (Prometheus, Grafana, ELK stack, OpenTelemetry, etc.).
- Hands-on experience with cloud platforms (AWS, Google Cloud Platform, or Azure), Kubernetes, and Docker.
- Proficiency with automation and IaC tools (Terraform, Ansible, Helm).
- Familiarity with incident management tools (PagerDuty, OpsGenie, ServiceNow).
- Strong scripting skills (Python, Bash, or similar).
- Excellent problem-solving and communication skills, with an ability to work across teams.
Preferred Skills
- Experience applying AI/ML techniques to IT operations or monitoring.
- Knowledge of SRE practices (SLIs/SLOs, error budgets).
- Background in high-scale, distributed systems.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.