Overview
Skills
Job Details
- Candidates must have 13+ Years of experience. Candidates should have lead and developer experience.
- We are looking for a highly skilled Automation Engineer with a strong systems engineering background to build scalable, resilient, and intelligent automation solutions. This role demands someone who thrives in aggressive environments, embraces complex challenges, and can operate effectively in uncertain situations. Someone with a automation-first mindset to drive efficiency, reduce manual toil, and enhance operational excellence using modern automation solutions will be a great fit. - As part of a mission-critical team, you will work on automating infrastructure, integrating tools via APIs, improving observability, and implementing AIOps-driven solutions. If you re passionate about problem-solving, AI/ML in operations, and optimizing large-scale cloud environments, this role is for you. - Key Responsibilities - Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on Google Cloud Platform and Kubernetes.
- Continuously identify and implement the opportunities to enhance the operational excellence.
- Build proactive and innovative solutions that can scale.
- Implement and manage configuration automation using Ansible (desirable).
- Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
- Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
- Develop proactive monitoring and alerting solutions using tools like Splunk, Google Cloud Platform Operations Suite, Grafana, and Prometheus.
- Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
- Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
- Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
- Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
- Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.
 - Required Skills & Experience 
- Strong background in Systems Engineering with a focus on automation and reliability.
- Proficiency in Python (intermediate to expert level) for developing automation and integrations.
- Hands-on expertise with Kubernetes and cloud platforms (Google Cloud Platform or any major cloud).
- Experience integrating various tools and platforms via APIs and client libraries.
- Deep understanding of monitoring and alerting using Splunk, Google Cloud Platform Operations Suite, Grafana, and Prometheus.
- Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
- Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
- Experience with Ansible for infrastructure automation.
- Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
- Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.