AI/ML Engineer( ONLY LOCALS)

Overview

On Site
$60 - $70
Contract - W2

Skills

Adobe XD
Agile
Algorithms
AngularJS
Apache Kafka
Artificial Intelligence
CPU
Client/server
Cloud Computing
Collaboration
Communication
Continuous Delivery
Continuous Integration
Dashboard
Deep Learning
DevOps
Docker
Documentation
Forecasting
Git
Good Clinical Practice
Google Cloud
Google Cloud Platform
Grafana
IT Operations
Kubernetes
Machine Learning (ML)
Microsoft Azure
Microsoft Certified Professional
MongoDB
MySQL
Node.js
Project Scoping
Prompt Engineering
PyTorch
Python
React.js
Real-time
Regression Analysis
Sprint
Streaming
System Testing
TensorFlow
Testing
Time Series
Training
UI
UPS
Vector Databases
Vertex
Vue.js
Workflow
scikit-learn

Job Details

Project Description

Develop machine learning and deep learning solutions for observability data to enhance IT operations. Implement time series forecasting, anomaly detection, and event correlation models. Integrate LLMs using prompt engineering, fine-tuning, and RAG for incident summarization. Build MCP client-server architecture for seamless integration with the Grafana ecosystem. The project also focuses on predicting emissions using ML models and enhancing observability through dynamic dashboards.

Project Scope:

  • Develop accurate ML models for emissions prediction
  • Improve Grafana dashboards to make them dynamic, interactive, and user-friendly
  • Potential involvement in ML model development and refinement alongside UI enhancements

Key Deliverables:

  • Predictive ML models for emissions forecasting
  • Dynamic Grafana dashboards using React beyond standard static capabilities

Duties/Day to Day Overview

Duties / Day-to-Day Responsibilities:

Machine Learning & Model Development

  • Design and develop ML/DL models for:
    • Time series forecasting (system load, CPU/memory usage)
    • Anomaly detection in logs, metrics, or traces
    • Event classification and correlation to reduce alert noise
  • Select, train, and tune models using TensorFlow, PyTorch, or scikit-learn
  • Evaluate model performance with precision, recall, F1-score, and AUC

ML Pipeline Engineering

  • Build scalable training and inference pipelines (batch or streaming)
  • Preprocess large observability datasets (Prometheus, Kafka, BigQuery)
  • Deploy models using cloud-native services (Google Cloud Platform Vertex AI, Azure ML, Docker/Kubernetes)
  • Maintain retraining pipelines and monitor model drift

LLM Integration for Observability Intelligence

  • Implement LLM-based workflows for summarizing incidents or logs
  • Develop and refine prompts for GPT, LLaMA, or other LLMs
  • Integrate Retrieval-Augmented Generation (RAG) with vector databases (FAISS, Pinecone)
  • Control latency, hallucinations, and cost in production LLM pipelines

Grafana & MCP Ecosystem Integration

  • Build or extend MCP client/server components for Grafana
  • Surface ML outputs (anomaly scores, predictions) in dashboards
  • Collaborate with observability engineers to integrate ML insights into monitoring tools

Collaboration & Agile Delivery

  • Participate in daily stand-ups, sprint planning, and retrospectives
  • Work with data engineers on pipeline performance and data ingestion
  • Collaborate with frontend developers for real-time visualizations
  • Partner with SRE and DevOps teams for alert tuning and feedback integration
  • Translate ML outputs into actionable insights for platform teams

Testing, Documentation & Version Control

  • Write unit, integration, and regression tests for ML code and pipelines
  • Maintain documentation on models, data sources, assumptions, and APIs
  • Use Git, CI/CD pipelines, and model versioning tools (MLflow, DVC)

Top Requirements

(Must haves)

Top Requirements / Must-Have Skills:

  • 6- 8 years Design and develop ML algorithms and DL applications for observability data (AIOps)
  • Hands-on experience in time series forecasting, anomaly detection, and event classification
  • Experience integrating LLMs with prompt engineering, fine-tuning, and RAG
  • Working knowledge of MCP client and server development for Grafana or similar
  • Programming: Python, R
  • ML Frameworks: TensorFlow or PyTorch, scikit-learn
  • Cloud Platforms: Google Cloud and/or Azure
  • Front-End: React or Angular or Vue.js, or jQuery
  • Design Tools: Figma or Adobe XD or Sketch
  • Databases: MySQL or MongoDB or PostgreSQL
  • Server-Side Languages: Python or Node.js or Java
  • Version Control: Git and related systems
  • Testing: Familiarity with testing frameworks and methodologies
  • Development Methodologies: Agile
  • Soft Skills: Strong communication and collaboration

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.