SRE Operations Engineer/SRE Engineer/DevOps Engineer


VDart, Inc.
Dice Job Match Score™
🤯 Applying directly to the forehead...
Job Details
Skills
- Kubernetes
- Prometheus
- Grafana
- ELK
- Splunk
- AWS
- API
Summary
Job Title: SRE Operations Engineer/SRE Engineer/DevOps Engineer
Duration: Canada/Remote
Location: 1 Year
Job Description:
The L1 SRE is the first line of defense in monitoring, triaging, and executing standardized operational tasks for all enterprise applications running on standard patterns and platforms like Kubernetes, APIs, WAF, databases, API Proxy (Gloo, APIGEE), Kafka, and Cloud (AWS/Azure/Google Cloud Platform). They will follow runbooks, leverage automation, and escalate appropriately to minimize downtime.
Skills
Mandatory Skills (Must-Have)
1. System & Infrastructure Monitoring
- Expectation: Ability to use monitoring dashboards (e.g., Grafana, Datadog, Splunk, Argos, AIOps) to identify anomalies, follow alert workflows, and escalate when thresholds are breached.
- Example: When a Kubernetes pod crash-loop is flagged in Prometheus, L1 should validate it against runbooks, check pod logs, and escalate if restart attempts fail.
2. Runbook Execution
- Expectation: Strictly follow documented steps to resolve standard incidents, escalate when steps do not apply or fail.
- Example: Use a provided runbook to restart a failed API proxy service; if error persists beyond documented steps, escalate to L2.
3. Incident Triage & Communication
- Expectation: Perform first-line triage of alerts, gather logs/metrics, categorize severity, and notify stakeholders in clear, concise language.
- Example: For a database connection timeout, collect error logs, verify service reachability, and provide a detailed incident note to L2 before escalation.
4. Kubernetes (Cloud or onprem) operations knowledge
- Expectation: Ability to check pod status, understand logs, and verify service endpoints using kubectl and monitoring tools.
- Example: Run kubectl get pods -n to verify if deployments are healthy.
5. Scripting (Python, Bash, PowerShell)
- Expectation: Able to read and make small edits to scripts to automate repetitive checks.
- Example: Modify a Bash script to include an additional log path in a health check.
6. Networking & Security Awareness
- Expectation: Understand troubleshooting (ping, netstat, curl, traceroute) and know when issuesmay be related to firewall, WAF, or proxy.
- Example: For an unreachable service, confirm DNS resolution and connectivity before escalating toL2.
7. Documentation & Knowledge Capture
- Expectation: Accurately record steps taken during incidents, suggest runbook updates where gapsexist.
- Example: After handling an alert for disk usage, note missing cleanup steps in the runbook and flagfor update.
Preferred Skills (Nice-to-Have)
1. Cloud Platform Familiarity (AWS, Azure, Google Cloud Platform)
- Expectation: Understand basics of cloud services (VMs, load balancers, storage) and how tonavigate a cloud console.
- Example: Use AWS Console to check EC2 instance health status when a service alert is triggered.
2.Database Basics (SQL/NoSQL)
- Expectation: Run simple queries to validate DB connectivity and health.
- Example: Execute
- SELECT 1; to verify a database is reachable.
3. Automation & Self-Service Mindset
- Expectation: Identify repetitive manual steps and propose candidates for automation.
- Example: Flag that manual log collection during outages could be replaced with a script.
4. Exposure to Incident Management Tools (xMatters, ServiceNow, Jira, etc.)
- Expectation: Comfortable working within ITSM/incident workflows.
- Example: Log incident details in ServiceNow with accurate categorization and timestamps.
5. AI/Chatbot-Assisted Ops (emerging skill)
- Expectation: Use AI assistants to search runbooks or suggest remediation steps.
- Example: Ask an AI ops assistant to summarize logs before escalation.
Qualifications
- 2–5 years in IT operations, NOC, or SRE/DevOps engineer role.
- Kubernetes 101, Linux 101, Networking 101
- Understanding of cloud-ready applications
- Understanding of observability tools (Prometheus, Grafana, ELK, Splunk, etc.).
- Strong troubleshooting mindset, ability to follow structured workflows. Eg: 5 Why?s and Fishbone
- Monitor system health, alerts, dashboards, and logs across cloud and on-prem infrastructure.
- Ability to isolate functional issue with application versus platform
- Execute standardized runbooks for incident resolution, deployments, and routine tasks.
- Perform initial triage of incidents and escalate to L2/L2+ as needed to mitigate the issue to get tobypass.
- Document new issues, gaps in runbooks, and automation opportunities.
- Provide excellent communication to stakeholders during incidents.
- Support onboarding of new applications into the operations framework.
Keywords: Kubernetes, Prometheus, Grafana, ELK, Splunk, AWS, API
- Dice Id: 10330808
- Position Id: 97483-5195-
- Posted 2 hours ago
Company Info
VDart, headquartered in Atlanta, GA, is a global leader in digital talent solutions and IT staffing, delivering top technology professionals to businesses worldwide. With a strong presence across North America, Europe and Asia, we specialize in helping organizations navigate complex technology landscapes with the right expertise.
Through a strategic, client-focused approach, we have placed over 20,000 professionals across key industries and advanced technology solutions. Whether placing top talent in cutting-edge roles or providing strategic digital workforce solutions, our network of 4,000 specialists across 13 countries is committed to excellence, agility and impact.
Backed by 18 years of industry experience, we go beyond staffing to build long-term partnerships that accelerate digital transformation and drive sustained growth. Whether you need a technology partner to fuel innovation or specialized workforce solutions to maintain a competitive edge, VDart delivers the right people, skills and mindset to create a lasting impact in a digital-first world.
Similar Jobs
It looks like there aren't any Similar Jobs for this job yet.
Search all similar jobs