Overview
Skills
Job Details
**************LOCAL PREFERRED***********************
We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in Apache Flink, Kubernetes, and automation. The ideal candidate will be responsible for designing, deploying, and maintaining scalable, resilient systems, while ensuring high availability and performance in production environments. This role requires a solid background in distributed systems, container orchestration, and DevOps practices.
Key Responsibilities
-
Design, implement, and maintain scalable Apache Flink deployments on Kubernetes.
-
Develop automation tools and scripts to streamline deployment, monitoring, and maintenance of Flink jobs and infrastructure.
-
Ensure high availability, scalability, and reliability of production systems.
-
Collaborate with development and infrastructure teams to optimize application performance.
-
Build and manage monitoring/alerting systems using Prometheus, Grafana, ELK stack, or similar tools.
-
Work with cloud platforms (AWS, Google Cloud Platform, Azure) to design and manage infrastructure.
-
Apply best practices for networking, security, and container orchestration.
-
Troubleshoot complex production issues and drive root cause analysis.
-
Contribute to CI/CD pipelines for deployment automation.
-
Participate in on-call rotations to ensure uptime and reliability.
Required Skills & Qualifications
-
Strong hands-on experience with Apache Flink in production environments.
-
Expertise in Kubernetes (Helm, Operators, CRDs).
-
Proficiency in scripting languages (Python, Bash, Go).
-
Experience with monitoring & observability tools (Prometheus, Grafana, ELK, etc.).
-
Solid understanding of cloud platforms (AWS, Google Cloud Platform, Azure).
-
Strong knowledge of networking, security, and container orchestration.
-
Familiarity with CI/CD pipelines and DevOps practices.
-
Excellent problem-solving, debugging, and communication skills.