Overview
Skills
Job Details
We are seeking a forward-thinking AI Ops Engineer to design and implement intelligent automation solutions that enhance IT operations using cutting-edge technologies such as LangChain, Elasticsearch, and Large Language Models (LLMs). The ideal candidate will build autonomous agents for log analysis, anomaly detection, and incident response, driving operational efficiency, proactive issue resolution, and improved observability across the infrastructure.
Key Responsibilities
Design, develop, and implement autonomous agents leveraging LangChain to perform advanced log analysis, anomaly detection, and automated incident response.
Integrate Elasticsearch with LangChain to enable efficient extraction, summarization, and visualization of operational data, supporting observability and performance monitoring.
Automate end-to-end workflows for incident detection, alert generation, and automated remediation using Python and LLM-based agents.
Build and maintain integrations with Slack and Microsoft Teams to enable real-time alerting, collaborative incident management, and automated communication workflows.
Continuously monitor, evaluate, and optimize AI-driven operational solutions to improve accuracy, performance, and reliability.
Collaborate closely with DevOps, SRE, and IT Operations teams to understand pain points and design AI-powered solutions that reduce manual effort and improve operational efficiency.
Ensure that all solutions follow security, compliance, and reliability best practices.
Document solutions, integrations, and processes to facilitate maintenance and knowledge sharing.
Stay current with emerging trends in AI Ops, LangChain development, observability platforms, and large language model technologies.