Data Scientist

Overview

On Site
$70 - $80
Accepts corp to corp applications
Contract - Independent
Contract - W2
Contract - 6 Month(s)

Skills

Log Analysis
Natural Language Processing
Predictive Analytics
Splunk
Storage
Scalability
data science

Job Details

Project Engagement Overview

The primary goal of this engagement is to reduce triage time and enhance the overall efficiency of the incident resolution process. Currently, the triage process takes an average of five days, and the objective is to bring it down to 2-3 days using data-driven approaches and AI models.

Key Objectives:

  • Log Analysis & Data Exploration Assess available logs, identify gaps, and streamline data extraction.
  • AI/ML Implementation Use predictive analytics and NLP to detect patterns and enhance triage efficiency.
  • Incident Classification & Prioritization Automate classification, prioritize critical issues, and reduce manual efforts.
  • Root Cause Analysis (RCA) Correlate logs, automate RCA, and create a knowledge base for faster troubleshooting.
  • Performance Metrics & Continuous Improvement Define KPIs, optimize workflows, and ensure sustainable improvements.
  • Scalability & Future Readiness Develop AI-driven solutions that integrate with long-term IT operations.

Relevant Experience

  1. AI/ML for Incident Triage Automation
  • Example Projects:
  • Built a log classification model (e.g., using NLP to categorize errors as "Network," "Storage," "CPU Overload").
  • Developed a priority scoring system (e.g., ML model predicting P0/P1 incidents based on historical data).
  • Reduced false positives in alerts using anomaly detection (e.g., Isolation Forest, LSTM for time-series logs).
  • Tools/Frameworks:
  • NLP: BERT, LogBERT, spaCy for log parsing.
  • ML: Scikit-learn, PyTorch for classification/regression.
  • LLMs: Fine-tuned models (e.g., Llama, GPT) for log summarization.
  1. Log Analysis & Correlation for Faster RCA
  • Example Experience:
  • Automated root cause suggestion by correlating logs (e.g., linking a "disk full" error to slow VM performance).
  • Created a knowledge graph of past incidents to accelerate troubleshooting (e.g., Neo4j, GraphML).
  • Used time-series clustering (e.g., K-Means, DBSCAN) to group similar incidents.
  • Tools: ELK Stack, Splunk, Grafana, Prometheus.
  1. Reducing Manual Triage Effort
  • Example Work:
  • Designed a chatbot/Slack integration to auto-respond to common incidents (e.g., "High CPU usage detected suggested fix: Kill process X").
  • Implemented automated ticket routing (e.g., using ML to assign tickets to the right team).
  1. Key Metrics to Highlight (Impact)
    Candidates should quantify past achievements, such as:
  • "Reduced average triage time by 40% by automating log classification."
  • "Cut RCA time from 8 hours to 1 hour using a correlation engine."
  • "Lowered escalations by 30% with a priority-scoring model."

Ideal Candidate Profiles Strong Fit:

  • ML Engineer at an Observability Company
  • Worked at Splunk/Datadog/New Relic on log analytics.
  • Built models to detect anomalies in Kubernetes/cloud logs.
  • Data Scientist in IT Operations (ITOps/SRE)
  • Automated incident response at a cloud provider (AWS/Azure/Google Cloud Platform).
  • Used NLP to parse Jira/ServiceNow tickets for faster resolution.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About MSys Technologies - USA