Role: Lead Data Science Consultant (MLOps & API Performance Analytics)
Location: Denver, CO (Onsite)
Duration: Long-term
Note: Need 10+ years of experienced candidates.
About the Role:
We’re seeking a Senior Data Science Consultant with deep experience in Data Science, DevOps/MLOps, and Data Visualization, with an added focus on API performance tracking, analytics, troubleshooting, predictive reliability, and pattern identification.
In this senior role, you will:
- Architect and deliver scalable, production‑grade data and ML solutions
- Lead cross‑functional initiatives to improve system and API reliability
- Build predictive models that forecast failures before they occur
- Guide teams through complex troubleshooting and performance optimization
- Influence technical strategy and engineering standards across the organization
- You will partner with backend engineering, SRE, data engineering, and product teams to deliver high‑impact, data‑driven improvements to stability and performance.
Key Responsibilities:
Data Science & Analytics:
- Lead the design and execution of complex analytical frameworks to detect patterns, anomalies, and failure precursors.
- Conduct advanced EDA to uncover multi‑layer correlations across product, operational, and infrastructure datasets.
- Apply predictive modeling and machine learning to identify where system or API issues are most likely to occur.
- Use statistical process control and drift detection techniques to ensure ongoing operational stability.
- Build simulation and forecasting models to evaluate the impact of load changes, upgrades, or new features on system behavior.
- Establish and enforce best practices for reproducible research, model validation, and experimentation.
API & System Performance Analytics:
- Architect end‑to‑end observability solutions to track API latency, throughput, error rates, saturation, and SLO adherence.
- Build automated pipelines that ingest, aggregate, and model API telemetry logs and traces (OpenTelemetry, Prometheus, CloudWatch, Application Insights, etc.).
- Detect and explain leading indicators of API instability using anomaly detection, time‑series forecasting, and multivariate correlation.
- Provide engineering with “risk heatmaps” to identify high‑risk services, endpoints, or infrastructure components.
- Predictive Reliability & Proactive Mitigation (Senior-Level Additions)
- Design and implement predictive models that forecast outages, SLA breaches, or performance regressions.
- Develop automated early‑warning systems integrated into observability platforms.
- Architect proactive mitigation workflows:
- Adaptive scaling rules
- Automated rollback/canary strategies
- Circuit breakers and fault‑tolerance improvements
- Predictive alerting thresholds
DevOps / MLOps:
- Architect and optimize CI/CD workflows for model deployment and data pipelines.
- Develop and maintain Docker/Kubernetes‑based services for training, inference, and analytics.
- Implement observability frameworks for ML workloads, ensuring traceability, logging, and performance monitoring.
- Maintain model registries, drift detection systems, and automated retraining strategies.
- Use IaC (Terraform/Bicep/CloudFormation) to maintain secure, reproducible environments.
Data Engineering:
- Design and optimize scalable ETL/ELT pipelines across batch and streaming architectures.
- Develop transformations, semantic layers, and feature stores supporting both predictive analytics and operational monitoring.
- Integrate API event logs, telemetry, and performance metrics into high‑quality analytics datasets.
- Establish data quality SLAs and automated validation processes.
Data Visualization & Decision Support:
- Build executive‑quality dashboards that communicate API health, KPIs, predictive signals, and operational trends.
- Create advanced visualizations: forecast bands, anomaly indicators, latency distributions, saturation patterns, and future‑state projections.
- Standardize visualization frameworks, semantic metrics, and documentation across teams.
- Influence decision‑making by translating predictive findings into clear, concise recommendations.
Collaboration, Leadership & Technical Ownership:
- Serve as a technical leader across engineering, driving standards for reliability, observability, and data‑driven decision‑making.
- Mentor engineers and data scientists, conducting code reviews, design reviews, and knowledge‑sharing sessions.
- Lead post‑incident reviews and guide teams in building lasting solutions, not short‑term patches.
- Partner with product and engineering leadership to define roadmaps, set metrics, and prioritize improvements.
- Communicate complex technical topics to executives with clarity and measurable impact.
- Champion a culture of quality, automation, performance excellence, and continuous improvement.
Qualifications:
Required
- Senior‑level proficiency in Python, SQL, and software engineering best practices (testing, design patterns, modular architecture).
- Extensive experience with observability data: logs, metrics, traces, service topology, and distributed systems behavior.
- Hands‑on experience with API performance tools (Grafana, Prometheus, Datadog, New Relic, Splunk, Azure Monitor, CloudWatch, etc.).
- Strong understanding of SLOs, SLIs, latency percentiles, error budgets, traffic analysis, and capacity planning.
- Deep experience with CI/CD pipelines, Git‑based workflows, and automated deployments.
- Strong skills in Docker/Kubernetes and cloud-native microservice environments.
- Expertise in data visualization tools (Power BI, Tableau, Looker) and Python visualization libraries.
- Experience with time-series modeling, anomaly detection, and forecasting (ARIMA, Prophet, Holt‑Winters, LSTM, etc.).
- Proven ability to troubleshoot complex, distributed system issues and drive long‑term resolutions.
- Demonstrated ability to own systems end‑to‑end through design, implementation, deployment, and maintenance.
Preferred:
- Experience in system or API predictive modeling (e.g., Monte Carlo, reliability models).
- Experience building risk scoring systems for performance, stability, or reliability.
- Familiarity with distributed tracing tools (OpenTelemetry, Jaeger, Zipkin).
- Experience with SRE practices and incident‑response engineering.
- Experience with dbt, Airflow, Dagster, or Prefect for orchestration.
- Experience with MLflow, Databricks, SageMaker, Azure ML, or similar MLOps platforms.
- Ability to design automated mitigation strategies (predictive alerts, auto-scaling, failure‑prevention policies).
- Experience influencing cross‑team architecture decisions in large, complex systems.
Success Metrics:
- Reduction in API incidents and faster MTTD/MTTR.
- Improved system reliability: lower error rates, improved latency, higher throughput.
- Successful prediction and prevention of performance degradations before they occur.
- High adoption of dashboards, predictive models, and observability tools.
- Significant improvements in deployment velocity, ML reliability, and platform stability.
- Positive cross-team feedback on leadership, mentorship, and collaboration.