Sr ML Engineer

Overview

On Site

$110,000 - $120,000

Full Time

Unable to Provide Sponsorship

Skills

Artificial Intelligence

Machine Learning (ML)

Machine Learning Operations (ML Ops)

Pandas

Incident Management

Deep Learning

PyTorch

RESTful

Public Sector

Python

NumPy

Large Language Models (LLMs)

Language Models

Grafana

Modeling

ROOT

FedRAMP

Data Validation

Dashboard

Kubernetes

Fluency

Mathematics

Job Details

Role title
Senior Machine Learning Engineer

Location
Dallas Fort Worth metro area, Texas
This is a full time on site role. Candidates must already be based in the Dallas area or be willing to relocate and work on site from our Dallas office on a regular schedule.

Role overview

We are hiring a Senior Machine Learning Engineer who will own the design and delivery of production grade machine learning capabilities for an observability and operations intelligence platform that serves large enterprise and public sector customers. The platform supports workloads that follow strict security and compliance expectations including support for United States Federal government environments.

This is an on site and hands on role based in the Dallas area. The engineer will work on alert noise reduction, anomaly detection, semantic search, and incident root cause assistance and will take solutions from concept to production. The role involves close partnership with on site DevOps, data engineering, and product teams and requires comfort working with operational data such as alerts, logs, metrics, and incident tickets.

Core responsibilities

Machine learning solution design

Own the machine learning design for operations and reliability use cases including alert noise reduction, alert grouping and clustering, anomaly detection, incident root cause assistance, and cost or usage insights
Translate product requirements and reliability targets into clear machine learning problems with well defined metrics such as false positive rate, false negative rate, alert reduction goals, and impact on incident handling time
Select appropriate model families for each use case including supervised and unsupervised classical models, deep learning models, and language model based approaches where appropriate

Data and feature engineering for operational data

Work with data engineering to define and refine pipelines that ingest monitoring alerts, events, logs, metrics, and incident or ticket data from operations tools
Design features that capture temporal patterns, service and infrastructure relationships, and business criticality of systems and alerts
Implement data validation rules and data quality checks and collaborate on detection and handling of data drift and schema evolution

MLOps, deployment, and lifecycle management

Establish and maintain a modern machine learning operations workflow including experiment tracking, model registry, automated training, and automated deployment
Build production ready inference services such as synchronous application programming interfaces, batch scoring jobs, and streaming based scoring that integrate with backend services and user interfaces
Collaborate with on site DevOps on deployment patterns in secure environments including staging, canary releases, controlled rollouts, and rollback strategies
Define retraining strategies and schedules for models whose performance depends on changing alert distributions and operational patterns

Evaluation, monitoring, and safety

Design offline and online evaluation suites using historical alert and incident data including realistic scenarios for alert suppression and recommendation quality
Build dashboards that make model behaviour and impact transparent to product owners, operations teams, and technical leadership
Monitor model performance and drift in production and drive corrective actions when degradation occurs
Incorporate feedback from operators and subject matter experts into continual improvement cycles and where suitable into active learning workflows

Security, compliance, and public sector readiness

Work within the constraints of secure and regulated deployments including strict access control, logging, and change management practices
Ensure that experimentation and training environments that use sensitive or regulated data follow required security and compliance guidelines including expectations associated with United States Federal government workloads and FedRAMP style environments
Document model inputs, outputs, assumptions, and controls so that the design can be reviewed by security, compliance, and audit teams

Cross team collaboration

Coordinate shared machine learning components across multiple products such as embedding services, semantic search services, and evaluation frameworks
Participate in architecture and design discussions to promote reuse of patterns and components across the AI and data platform
Provide mentoring and technical guidance to junior engineers and data scientists where needed

On site collaboration in Dallas

Work primarily from the Dallas area office in close coordination with local engineering, product, and leadership teams
Participate in in person design sessions, whiteboard reviews, and incident reviews that require physical presence and real time collaboration
Help build a strong on site engineering culture through knowledge sharing, pair design, and support for local team members

Required qualifications

Bachelor s or Master s degree in Computer Science, Engineering, Mathematics, or a related field or equivalent practical experience
At least five years of hands on machine learning engineering experience with a strong record of shipping models into production systems
Strong programming skills in Python with fluency in libraries such as NumPy, pandas, scikit learn and at least one deep learning framework such as PyTorch or TensorFlow
Proven experience building and operating production machine learning systems including application programming interfaces, batch jobs, or streaming jobs and partnering with DevOps teams
Solid understanding of the full machine learning lifecycle including data preparation, feature engineering, model training, evaluation, deployment, and ongoing monitoring
Experience with at least one major cloud provider. Experience with Amazon Web Services is preferred including familiarity with services such as managed container platforms, serverless functions, object storage, and managed machine learning platforms
Experience with machine learning operations practices and tools such as experiment tracking, model registry, automated training pipelines, and automated deployment pipelines
Strong skills in experiment design and interpretation including backtesting, A and B style testing, and detailed error analysis
Excellent communication skills with the ability to explain model behaviour and trade offs to engineers, product managers, and operations stakeholders
Ability and willingness to work full time on site in the Dallas Fort Worth metro area

Preferred qualifications

Experience with observability and operations domains such as monitoring alerts, logs, metrics, traces, and incident ticket systems
Experience in environments that support United States Federal government or other highly regulated workloads with an understanding of security and compliance constraints
Background in large language models and retrieval augmented search or summarization for operational or knowledge management use cases
Familiarity with vector databases and semantic search platforms and experience building embedding based retrieval systems
Experience delivering anomaly detection, clustering, and time series modelling solutions at meaningful scale
Prior experience in a product engineering setting where the engineer owns design, implementation, and operational aspects of machine learning services

Technology stack

Candidates do not need experience with every item but should be comfortable with most of the following.

Languages and libraries
- Python, NumPy, pandas, scikit learn
- PyTorch or TensorFlow or comparable deep learning tools
Cloud and infrastructure
- Major cloud provider such as Amazon Web Services or similar
- Experience with services such as managed Kubernetes or container platforms, serverless compute, object storage, and managed machine learning services
Machine learning operations and orchestration
- Tools for experiment tracking and model registry such as MLflow or managed registry services
- Workflow tools such as Apache Airflow or managed workflow orchestration services
Observability and integration
- Familiarity with logging and metrics solutions such as CloudWatch, Prometheus, Grafana, or similar
- Experience exposing machine learning services through restful application programming interfaces or similar integration methods

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share