SRE Architect

Remote • Posted 2 hours ago • Updated 2 hours ago
Full Time
No Travel Required
Remote
$150 - $180/hr
Company Branding Image
Fitment

Dice Job Match Score™

🔢 Crunching numbers...

Job Details

Skills

  • ELK
  • SRE
  • Kibana

Summary

Job Description: Senior SRE Consultant

Position Overview

We are seeking a highly skilled and experienced Senior Site Reliability Engineering (SRE) Consultant to join our team. The ideal candidate will have a strong background in ELK (Elasticsearch, Logstash, Kibana), KQL (Kusto Query Language), Dynatrace DQL (Dynatrace Query Language), and PagerDuty. The candidate should possess a deep understanding of Site Reliability Engineering (SRE) principles, monitoring maturity frameworks, observability frameworks, and their respective maturity models. Additionally, the candidate should have experience in application discovery, log streaming to ELK, and developing metrics to assess the maturity of monitoring in applications.


Key Responsibilities

  1. SRE Practices and Implementation:
    • Design, implement, and maintain SRE best practices to improve system reliability, scalability, and performance.
    • Develop and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical applications.
    • Drive incident management processes, including root cause analysis (RCA), post-mortems, and continuous improvement initiatives.
  2. Monitoring and Observability:
    • Develop and implement a Monitoring Maturity Framework to assess and improve the monitoring capabilities of applications and infrastructure.
    • Design and implement an Observability Framework to provide end-to-end visibility into system performance, availability, and reliability.
    • Define and implement metrics, logs, and traces to measure the maturity of monitoring and observability in applications.
    • Collaborate with development and operations teams to ensure proper instrumentation of applications and infrastructure.
  3. Log Management and Analysis:
    • Lead the discovery of applications and design strategies for log streaming from applications to the ELK stack.
    • Configure and optimize Logstash pipelines for efficient log ingestion and transformation.
    • Develop KQL queries and Kibana dashboards to analyze logs and provide actionable insights.
    • Ensure log data is structured, enriched, and indexed for efficient querying and visualization.
  4. Performance Monitoring and Optimization:
    • Utilize Dynatrace DQL to analyze application performance and identify bottlenecks.
    • Implement and maintain PagerDuty for incident alerting and on-call management.
    • Develop and maintain runbooks and playbooks for incident response and resolution.
  5. Monitoring and Observability Maturity Assessment:
    • Develop a maturity matrix to evaluate the current state of monitoring and observability across applications and infrastructure.
    • Identify gaps in monitoring and observability and create actionable roadmaps to improve maturity levels.
    • Provide recommendations for tools, processes, and practices to enhance monitoring and observability.
  6. Collaboration and Stakeholder Engagement:
    • Work closely with cross-functional teams, including developers, DevOps engineers, and IT operations, to implement SRE practices and monitoring solutions.
    • Conduct workshops and training sessions to improve the team's understanding of SRE, monitoring, and observability concepts.
    • Act as a subject matter expert (SME) for monitoring, observability, and incident management.

Key Skills and Qualifications

Technical Skills:

  1. ELK Stack (Elasticsearch, Logstash, Kibana):
    • Expertise in designing, configuring, and managing ELK pipelines for log ingestion, transformation, and visualization.
    • Experience in creating advanced Kibana dashboards for monitoring and troubleshooting.
    • Proficiency in querying Elasticsearch and optimizing search performance.
  2. KQL (Kusto Query Language):
    • Strong experience in writing and optimizing KQL queries for analyzing logs and metrics in Azure Log Analytics or Application Insights.
  3. Dynatrace DQL (Dynatrace Query Language):
    • Proficiency in using Dynatrace DQL to analyze application performance, identify bottlenecks, and create custom dashboards.
  4. PagerDuty:
    • Experience in configuring and managing PagerDuty for incident alerting, on-call scheduling, and escalation policies.
  5. Monitoring and Observability Frameworks:
    • Deep understanding of monitoring maturity frameworks and the ability to assess and improve monitoring capabilities.
    • Experience in designing and implementing observability frameworks (logs, metrics, traces) for distributed systems and microservices.
  6. Application Discovery and Log Streaming:
    • Experience in application discovery to identify log sources and dependencies.
    • Expertise in setting up log streaming pipelines from applications to ELK or other log management systems.
  7. SRE Core Concepts:
      • Strong understanding of SRE principles, including:
      • Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
      • Incident management, root cause analysis (RCA), and post-mortems.
      • Error budgets and their role in balancing reliability and innovation.

Soft Skills:

  • Strong analytical and problem-solving skills.
  • Excellent communication and collaboration skills to work with cross-functional teams.
  • Ability to mentor and guide teams on SRE and observability best practices.
  • Strong organizational skills and the ability to manage multiple priorities.

Key Responsibilities Related to Monitoring Maturity and Observability

Monitoring Maturity Framework:

  • Develop and implement a Monitoring Maturity Framework to assess the current state of monitoring across applications.
  • Define key metrics to measure monitoring maturity, such as:
    • Percentage of applications with proper instrumentation.
    • Percentage of logs, metrics, and traces ingested into the observability platform.
    • Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents.
    • Coverage of critical business transactions in monitoring systems.

Observability Framework and Maturity:

  • Design and implement an Observability Framework that includes:
    • Logs: Centralized log aggregation and analysis.
    • Metrics: Real-time monitoring of application and infrastructure performance.
    • Traces: Distributed tracing for end-to-end visibility into application workflows.
  • Define an Observability Maturity Model with levels such as:
    • Level 1: Basic monitoring (e.g., infrastructure metrics, basic logs).
    • Level 2: Application-level monitoring (e.g., custom metrics, error tracking).
    • Level 3: Full observability (e.g., logs, metrics, traces, and business KPIs).
  • Create a roadmap to improve observability maturity across the organization.

Application Discovery and Log Streaming:

  • Perform application discovery to identify all applications, services, and their dependencies.
  • Design and implement log streaming pipelines to ingest logs from applications into ELK or other observability platforms.
  • Ensure logs are structured, enriched, and tagged with metadata (e.g., application name, environment, region) for easy querying and analysis.

 

 

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10330808
  • Position Id: 96006-5195-
  • Posted 2 hours ago

Company Info

About VDart, Inc.

VDart, headquartered in Atlanta, GA, is a global leader in digital talent solutions and IT staffing, delivering top technology professionals to businesses worldwide. With a strong presence across North America, Europe and Asia, we specialize in helping organizations navigate complex technology landscapes with the right expertise.

Through a strategic, client-focused approach, we have placed over 20,000 professionals across key industries and advanced technology solutions. Whether placing top talent in cutting-edge roles or providing strategic digital workforce solutions, our network of 4,000 specialists across 13 countries is committed to excellence, agility and impact.

Backed by 18 years of industry experience, we go beyond staffing to build long-term partnerships that accelerate digital transformation and drive sustained growth. Whether you need a technology partner to fuel innovation or specialized workforce solutions to maintain a competitive edge, VDart delivers the right people, skills and mindset to create a lasting impact in a digital-first world.

Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

It looks like there aren't any Similar Jobs for this job yet.

Search all similar jobs