Overview
Skills
Job Details
Overview:
We are seeking an experienced and results-driven Senior Site Reliability Engineer (SRE) to join a high-
impact aviation technology project. This role requires a strong background in Java development, cloud
infrastructure, and site reliability best practices. The ideal candidate will bring a deep understanding of
system scalability, fault tolerance, observability, and hands-on production support in Kubernetes-based
environments running on Google Cloud Platform (Google Cloud Platform).
Core Responsibilities:
Design, implement, and maintain Java-based microservices ensuring high availability, scalability, and
performance.
Collaborate with development and infrastructure teams to support and optimize production
systems using SRE principles.
Manage and maintain Kubernetes clusters, including deployments, scaling, networking, and
storage.
Develop and maintain robust CI/CD pipelines using tools like GitLab CI/CD and Jenkins.
Build automation for system health monitoring, alerting, log aggregation, and recovery using tools
such as Prometheus, Datadog, Splunk, and Kiali.
Integrate and operate event-driven systems leveraging Kafka, KSQLDB, Spark Streams, and cluster
federation.
Deploy and manage service mesh technologies such as Istio and Anthos Service Mesh.
Utilize EBPF for advanced observability and system tracing.
Support containerized applications using Docker, and infrastructure provisioning with Terraform.
Administer storage solutions in Kubernetes environments using Portworx.
Required Qualifications:
10+ years of experience in SRE.
Strong proficiency in Java is mandatory.
Solid experience in scripting languages like Python, Go, and Bash.
Deep understanding of Linux/Unix operating systems and system-level troubleshooting.
Proven experience with Kubernetes, Docker, and infrastructure as code tools like Terraform.
Strong background in CI/CD, monitoring, alerting, and performance tuning.
Hands-on experience with virtualization platforms including VMware.
Familiarity with tools like Nginx Controller, Seesaw, and service mesh technologies.
Proficient in handling large-scale systems and capable of automating repetitive operational tasks.
Experience with functional programming languages such as Prolog, Haskell, or OCaml is a plus.
Certification in Kubernetes is required.
Hands-on experience working in Google Cloud Platform environments is strongly required.