| Develop machine learning and deep learning solutions for observability data to enhance IT operations. Implement time series forecasting, anomaly detection, and event correlation models. Integrate LLMs using prompt engineering, fine-tuning, and RAG for incident summarization. Build MCP client-server architecture for seamless integration with the Grafana ecosystem. The project also focuses on predicting emissions using ML models and enhancing observability through dynamic dashboards. Project Scope: - Develop accurate ML models for emissions prediction
- Improve Grafana dashboards to make them dynamic, interactive, and user-friendly
- Potential involvement in ML model development and refinement alongside UI enhancements
Key Deliverables: - Predictive ML models for emissions forecasting
- Dynamic Grafana dashboards using React beyond standard static capabilities
|
Duties/Day to Day Overview | Duties / Day-to-Day Responsibilities: Machine Learning & Model Development - Design and develop ML/DL models for:
- Time series forecasting (system load, CPU/memory usage)
- Anomaly detection in logs, metrics, or traces
- Event classification and correlation to reduce alert noise
- Select, train, and tune models using TensorFlow, PyTorch, or scikit-learn
- Evaluate model performance with precision, recall, F1-score, and AUC
ML Pipeline Engineering - Build scalable training and inference pipelines (batch or streaming)
- Preprocess large observability datasets (Prometheus, Kafka, BigQuery)
- Deploy models using cloud-native services (Google Cloud Platform Vertex AI, Azure ML, Docker/Kubernetes)
- Maintain retraining pipelines and monitor model drift
LLM Integration for Observability Intelligence - Implement LLM-based workflows for summarizing incidents or logs
- Develop and refine prompts for GPT, LLaMA, or other LLMs
- Integrate Retrieval-Augmented Generation (RAG) with vector databases (FAISS, Pinecone)
- Control latency, hallucinations, and cost in production LLM pipelines
Grafana & MCP Ecosystem Integration - Build or extend MCP client/server components for Grafana
- Surface ML outputs (anomaly scores, predictions) in dashboards
- Collaborate with observability engineers to integrate ML insights into monitoring tools
Collaboration & Agile Delivery - Participate in daily stand-ups, sprint planning, and retrospectives
- Work with data engineers on pipeline performance and data ingestion
- Collaborate with frontend developers for real-time visualizations
- Partner with SRE and DevOps teams for alert tuning and feedback integration
- Translate ML outputs into actionable insights for platform teams
Testing, Documentation & Version Control - Write unit, integration, and regression tests for ML code and pipelines
- Maintain documentation on models, data sources, assumptions, and APIs
- Use Git, CI/CD pipelines, and model versioning tools (MLflow, DVC)
|
| Top Requirements / Must-Have Skills: - 6- 8 years Design and develop ML algorithms and DL applications for observability data (AIOps)
- Hands-on experience in time series forecasting, anomaly detection, and event classification
- Experience integrating LLMs with prompt engineering, fine-tuning, and RAG
- Working knowledge of MCP client and server development for Grafana or similar
- Programming: Python, R
- ML Frameworks: TensorFlow or PyTorch, scikit-learn
- Cloud Platforms: Google Cloud and/or Azure
- Front-End: React or Angular or Vue.js, or jQuery
- Design Tools: Figma or Adobe XD or Sketch
- Databases: MySQL or MongoDB or PostgreSQL
- Server-Side Languages: Python or Node.js or Java
- Version Control: Git and related systems
- Testing: Familiarity with testing frameworks and methodologies
- Development Methodologies: Agile
- Soft Skills: Strong communication and collaboration
|