Site Reliability Engineer
Remote • Posted 4 hours ago • Updated 4 hours ago

BlueSky Resource Solutions
Dice Job Match Score™
👾 Reticulating splines...
Job Details
Skills
- OpenTelemetry
- Kubernetes
- Observability
- Linux
- Windows
- Scripting
Summary
JOB DESCRIPTION
Site Reliability Engineer – Observability
Overview:
We are seeking a skilled Site Reliability Engineer III to join our Platform Engineering team, focusing on building and maintaining a comprehensive observability platform. In this role, you will ensure that our microservices, Kubernetes clusters, and cloud infrastructure are consistently reliable, high-performing, and scalable. You will work closely with cross-functional teams to provide deep insights into system health, performance, and availability through metrics, logs, and traces. This is a key role for those passionate about creating robust, proactive monitoring systems to support troubleshooting and optimization.
Responsibilities:
- Develop and sustain a resilient observability stack using tools such as Prometheus, Grafana, Loki, InfluxDB, Telegraf, OpenTelemetry, and more.
- Collaborate with DevOps, engineering, and product teams to understand monitoring requirements and deliver data-driven insights for better decision-making.
- Design and implement monitoring solutions across diverse environments, including Kubernetes clusters, microservices, AWS, Azure, on-prem vSphere setups, and networks using Windows, Linux, Cisco, Juniper, Arista, and more.
- Aggregate and store logs, metrics, and traces from distributed systems to ensure comprehensive, end-to-end visibility.
- Develop alerting mechanisms based on KPIs and thresholds to support proactive performance monitoring and application uptime.
- Create and maintain dashboards to monitor system health, application performance, and resource utilization.
- Build solutions for monitoring key application metrics, including latency, request rates, error rates, and service dependencies.
- Support incident response efforts, collaborating with DevOps, SRE, and development teams to troubleshoot and resolve performance issues.
- Define and implement automated incident response workflows using observability data.
- Participate in post-incident analyses to identify root causes and continuously improve system reliability.
- Identify areas to improve observability practices, including better instrumentation, alerting, and reporting strategies.
- Document observability setups, best practices, and troubleshooting techniques.
- Stay informed on the latest observability technologies and industry trends to enhance the observability ecosystem.
- Provide regular reports and dashboards on system health and performance metrics to ensure transparency for stakeholders.
Preferred Qualifications:
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field (or equivalent practical experience).
- 3–5 years of experience in observability, monitoring, or related areas such as SRE, DevOps, or Platform Engineering.
- Proven experience in building, scaling, and managing observability solutions for complex infrastructure environments (Kubernetes, AWS, Azure, on-prem vSphere, and Windows/Linux).
- Proficiency with Git version control, including branch management, conflict resolution, and GitHub workflows, along with experience in CI/CD using GitHub Actions.
- Familiarity with VMware vSphere, cloud platforms (AWS, Google Cloud Platform, Azure), and containerized environments (Docker and Kubernetes).
- Relevant certifications (e.g., VMware Certified Professional - VCP, AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, Certified Kubernetes Administrator) are a plus.
Skills and Abilities:
- Deep understanding of observability principles, including metrics, logs, and traces.
- Strong experience with monitoring tools (Prometheus, Grafana, InfluxDB, Telegraf, etc.) and Kubernetes/containerized workloads.
- Knowledge of cloud-native technologies, Infrastructure as Code (IaC), and DevOps practices.
- Experience with Application Performance Management (APM) tools.
- Proficient in scripting and automation with languages like Python, Bash, or Go.
- Skilled in data visualization and reporting, using tools like Grafana and Kibana.
- Ability to troubleshoot complex issues using logs, metrics, and traces for effective incident response.
- Strong collaboration and communication skills for working with SRE, DevOps, and engineering teams.
- Problem-solving mindset with attention to detail in designing observability solutions.
- Adaptable to a fast-paced, evolving technical environment.
- Eagerness to stay up-to-date with trends in observability, cloud technologies, and distributed systems.
- Dice Id: 91017554
- Position Id: 8891654
- Posted 4 hours ago
Company Info
About BlueSky Resource Solutions
At BlueSky we are passionate about people, technology and helping businesses succeed. Our expert team includes recruiters and customer and talent support specialists with over 100 years of combined telecom and technology experience. We focus on talent acquisition, placement and solutions for businesses from Fortune 500 to small and medium-sized companies.
Our Mission
To be the best talent acquisition partner for our valued clients offering unmatched service and delivery. We will achieve this through:
Growing the BlueSky brand throughout the tech community by expanding our local presence into diverse growing tech markets.
Hiring the best internal employees that align with our culture
Fostering genuine relationships with our clients
Continually learning the industry, the local markets and serving our candidates and clients better.
Our Values
At BlueSky Resources Solutions, our core values are based upon the tenets of trust, integrity and character. We value relationships above profits and believe that business grows best when built upon a foundation of relationship and trust. We will serve our customers and our talent partners with honesty and respect and have a policy of personal and professional accountability in all that we do. We are hard-working and reliable and take a solution-minded approach to every challenge.
Giving Back
We believe that each person has a responsibility to give of their time, talent and resources to one another. We are committed to giving a portion of all our revenue to people and organizations that are focused on serving those with the greatest needs. We give locally and globally and gladly work with customers and partners to multiply the impact we can make together.
Similar Jobs
It looks like there aren't any Similar Jobs for this job yet.
Search all similar jobs