Job Title: Data Science Architect (MLOps & API Performance Analytics)
Location: Denver, CO
Duration: Long-term
About the Role:
We’re seeking a Senior Software Engineer/Architect with deep experience in Data Science, DevOps/MLOps, and Data Visualization, with an added focus on API performance tracking, analytics, troubleshooting, predictive reliability, and pattern identification.
In this senior role, you will:
- Architect and deliver scalable, production‑grade data and ML solutions
- Lead cross‑functional initiatives to improve system and API reliability
- Build predictive models that forecast failures before they occur
- Guide teams through complex troubleshooting and performance optimization
- Influence technical strategy and engineering standards across the organization
You will partner with backend engineering, SRE, data engineering, and product teams to deliver high‑impact, data‑driven improvements to stability and performance.
Key Responsibilities:
Data Science & Analytics:
- Lead the design and execution of complex analytical frameworks to detect patterns, anomalies, and failure precursors.
- Conduct advanced EDA to uncover multi‑layer correlations across product, operational, and infrastructure datasets.
- Apply predictive modeling and machine learning to identify where system or API issues are most likely to occur.
- Use statistical process control and drift detection techniques to ensure ongoing operational stability.
- Build simulation and forecasting models to evaluate the impact of load changes, upgrades, or new features on system behavior.
- Establish and enforce best practices for reproducible research, model validation, and experimentation.
API & System Performance Analytics:
- Architect end‑to‑end observability solutions to track API latency, throughput, error rates, saturation, and SLO adherence.
- Build automated pipelines that ingest, aggregate, and model API telemetry logs and traces (OpenTelemetry, Prometheus, CloudWatch, Application Insights, etc.).
- Detect and explain leading indicators of API instability using anomaly detection, time‑series forecasting, and multivariate correlation.
- Provide engineering with “risk heatmaps” to identify high‑risk services, endpoints, or infrastructure components.
Predictive Reliability & Proactive Mitigation (Senior-Level Additions)
- Design and implement predictive models that forecast outages, SLA breaches, or performance regressions.
- Develop automated early‑warning systems integrated into observability platforms.
- Architect proactive mitigation workflows:
- Adaptive scaling rules
- Automated rollback/canary strategies
- Circuit breakers and fault‑tolerance improvements
- Predictive alerting thresholds
DevOps / MLOps:
- Architect and optimize CI/CD workflows for model deployment and data pipelines.
- Develop and maintain Docker/Kubernetes‑based services for training, inference, and analytics.
- Implement observability frameworks for ML workloads, ensuring traceability, logging, and performance monitoring.
- Maintain model registries, drift detection systems, and automated retraining strategies.
- Use IaC (Terraform/Bicep/CloudFormation) to maintain secure, reproducible environments.
Data Engineering:
- Design and optimize scalable ETL/ELT pipelines across batch and streaming architectures.
- Develop transformations, semantic layers, and feature stores supporting both predictive analytics and operational monitoring.
- Integrate API event logs, telemetry, and performance metrics into high‑quality analytics datasets.
- Establish data quality SLAs and automated validation processes.
Data Visualization & Decision Support:
- Build executive‑quality dashboards that communicate API health, KPIs, predictive signals, and operational trends.
- Create advanced visualizations: forecast bands, anomaly indicators, latency distributions, saturation patterns, and future‑state projections.
- Standardize visualization frameworks, semantic metrics, and documentation across teams.
- Influence decision‑making by translating predictive findings into clear, concise recommendations.
Collaboration, Leadership & Technical Ownership:
- Serve as a technical leader across engineering, driving standards for reliability, observability, and data‑driven decision‑making.
- Mentor engineers and data scientists, conducting code reviews, design reviews, and knowledge‑sharing sessions.
- Lead post‑incident reviews and guide teams in building lasting solutions, not short‑term patches.
- Partner with product and engineering leadership to define roadmaps, set metrics, and prioritize improvements.
- Communicate complex technical topics to executives with clarity and measurable impact.
- Champion a culture of quality, automation, performance excellence, and continuous improvement.
Qualifications:
Required
- Senior‑level proficiency in Python, SQL, and software engineering best practices (testing, design patterns, modular architecture).
- Extensive experience with observability data: logs, metrics, traces, service topology, and distributed systems behavior.
- Hands‑on experience with API performance tools (Grafana, Prometheus, Datadog, New Relic, Splunk, Azure Monitor, CloudWatch, etc.).
- Strong understanding of SLOs, SLIs, latency percentiles, error budgets, traffic analysis, and capacity planning.
- Deep experience with CI/CD pipelines, Git‑based workflows, and automated deployments.
- Strong skills in Docker/Kubernetes and cloud-native microservice environments.
- Expertise in data visualization tools (Power BI, Tableau, Looker) and Python visualization libraries.
- Experience with time-series modeling, anomaly detection, and forecasting (ARIMA, Prophet, Holt‑Winters, LSTM, etc.).
- Proven ability to troubleshoot complex, distributed system issues and drive long‑term resolutions.
- Demonstrated ability to own systems end‑to‑end through design, implementation, deployment, and maintenance.
Preferred:
- Experience in system or API predictive modeling (e.g., Monte Carlo, reliability models).
- Experience building risk scoring systems for performance, stability, or reliability.
- Familiarity with distributed tracing tools (OpenTelemetry, Jaeger, Zipkin).
- Experience with SRE practices and incident‑response engineering.
- Experience with dbt, Airflow, Dagster, or Prefect for orchestration.
- Experience with MLflow, Databricks, SageMaker, Azure ML, or similar MLOps platforms.
- Ability to design automated mitigation strategies (predictive alerts, auto-scaling, failure‑prevention policies).
- Experience influencing cross‑team architecture decisions in large, complex systems.
Success Metrics:
- Reduction in API incidents and faster MTTD/MTTR.
- Improved system reliability: lower error rates, improved latency, higher throughput.
- Successful prediction and prevention of performance degradations before they occur.
- High adoption of dashboards, predictive models, and observability tools.
- Significant improvements in deployment velocity, ML reliability, and platform stability.
- Positive cross-team feedback on leadership, mentorship, and collaboration.