Sr ML Engineer

Overview

On Site
$110,000 - $120,000
Full Time
Unable to Provide Sponsorship

Skills

Artificial Intelligence
Machine Learning (ML)
Machine Learning Operations (ML Ops)
Pandas
Incident Management
Deep Learning
PyTorch
RESTful
Public Sector
Python
NumPy
Large Language Models (LLMs)
Language Models
Grafana
Modeling
ROOT
FedRAMP
Data Validation
Dashboard
Kubernetes
Fluency
Mathematics

Job Details

Role title
Senior Machine Learning Engineer

Location
Dallas Fort Worth metro area, Texas
This is a full time on site role. Candidates must already be based in the Dallas area or be willing to relocate and work on site from our Dallas office on a regular schedule.

Role overview

We are hiring a Senior Machine Learning Engineer who will own the design and delivery of production grade machine learning capabilities for an observability and operations intelligence platform that serves large enterprise and public sector customers. The platform supports workloads that follow strict security and compliance expectations including support for United States Federal government environments.

This is an on site and hands on role based in the Dallas area. The engineer will work on alert noise reduction, anomaly detection, semantic search, and incident root cause assistance and will take solutions from concept to production. The role involves close partnership with on site DevOps, data engineering, and product teams and requires comfort working with operational data such as alerts, logs, metrics, and incident tickets.


Core responsibilities

Machine learning solution design

  • Own the machine learning design for operations and reliability use cases including alert noise reduction, alert grouping and clustering, anomaly detection, incident root cause assistance, and cost or usage insights
  • Translate product requirements and reliability targets into clear machine learning problems with well defined metrics such as false positive rate, false negative rate, alert reduction goals, and impact on incident handling time
  • Select appropriate model families for each use case including supervised and unsupervised classical models, deep learning models, and language model based approaches where appropriate

Data and feature engineering for operational data

  • Work with data engineering to define and refine pipelines that ingest monitoring alerts, events, logs, metrics, and incident or ticket data from operations tools
  • Design features that capture temporal patterns, service and infrastructure relationships, and business criticality of systems and alerts
  • Implement data validation rules and data quality checks and collaborate on detection and handling of data drift and schema evolution

MLOps, deployment, and lifecycle management

  • Establish and maintain a modern machine learning operations workflow including experiment tracking, model registry, automated training, and automated deployment
  • Build production ready inference services such as synchronous application programming interfaces, batch scoring jobs, and streaming based scoring that integrate with backend services and user interfaces
  • Collaborate with on site DevOps on deployment patterns in secure environments including staging, canary releases, controlled rollouts, and rollback strategies
  • Define retraining strategies and schedules for models whose performance depends on changing alert distributions and operational patterns

Evaluation, monitoring, and safety

  • Design offline and online evaluation suites using historical alert and incident data including realistic scenarios for alert suppression and recommendation quality
  • Build dashboards that make model behaviour and impact transparent to product owners, operations teams, and technical leadership
  • Monitor model performance and drift in production and drive corrective actions when degradation occurs
  • Incorporate feedback from operators and subject matter experts into continual improvement cycles and where suitable into active learning workflows

Security, compliance, and public sector readiness

  • Work within the constraints of secure and regulated deployments including strict access control, logging, and change management practices
  • Ensure that experimentation and training environments that use sensitive or regulated data follow required security and compliance guidelines including expectations associated with United States Federal government workloads and FedRAMP style environments
  • Document model inputs, outputs, assumptions, and controls so that the design can be reviewed by security, compliance, and audit teams

Cross team collaboration

  • Coordinate shared machine learning components across multiple products such as embedding services, semantic search services, and evaluation frameworks
  • Participate in architecture and design discussions to promote reuse of patterns and components across the AI and data platform
  • Provide mentoring and technical guidance to junior engineers and data scientists where needed

On site collaboration in Dallas

  • Work primarily from the Dallas area office in close coordination with local engineering, product, and leadership teams
  • Participate in in person design sessions, whiteboard reviews, and incident reviews that require physical presence and real time collaboration
  • Help build a strong on site engineering culture through knowledge sharing, pair design, and support for local team members

Required qualifications

  • Bachelor s or Master s degree in Computer Science, Engineering, Mathematics, or a related field or equivalent practical experience
  • At least five years of hands on machine learning engineering experience with a strong record of shipping models into production systems
  • Strong programming skills in Python with fluency in libraries such as NumPy, pandas, scikit learn and at least one deep learning framework such as PyTorch or TensorFlow
  • Proven experience building and operating production machine learning systems including application programming interfaces, batch jobs, or streaming jobs and partnering with DevOps teams
  • Solid understanding of the full machine learning lifecycle including data preparation, feature engineering, model training, evaluation, deployment, and ongoing monitoring
  • Experience with at least one major cloud provider. Experience with Amazon Web Services is preferred including familiarity with services such as managed container platforms, serverless functions, object storage, and managed machine learning platforms
  • Experience with machine learning operations practices and tools such as experiment tracking, model registry, automated training pipelines, and automated deployment pipelines
  • Strong skills in experiment design and interpretation including backtesting, A and B style testing, and detailed error analysis
  • Excellent communication skills with the ability to explain model behaviour and trade offs to engineers, product managers, and operations stakeholders
  • Ability and willingness to work full time on site in the Dallas Fort Worth metro area

Preferred qualifications

  • Experience with observability and operations domains such as monitoring alerts, logs, metrics, traces, and incident ticket systems
  • Experience in environments that support United States Federal government or other highly regulated workloads with an understanding of security and compliance constraints
  • Background in large language models and retrieval augmented search or summarization for operational or knowledge management use cases
  • Familiarity with vector databases and semantic search platforms and experience building embedding based retrieval systems
  • Experience delivering anomaly detection, clustering, and time series modelling solutions at meaningful scale
  • Prior experience in a product engineering setting where the engineer owns design, implementation, and operational aspects of machine learning services

Technology stack

Candidates do not need experience with every item but should be comfortable with most of the following.

  • Languages and libraries
    • Python, NumPy, pandas, scikit learn
    • PyTorch or TensorFlow or comparable deep learning tools
  • Cloud and infrastructure
    • Major cloud provider such as Amazon Web Services or similar
    • Experience with services such as managed Kubernetes or container platforms, serverless compute, object storage, and managed machine learning services
  • Machine learning operations and orchestration
    • Tools for experiment tracking and model registry such as MLflow or managed registry services
    • Workflow tools such as Apache Airflow or managed workflow orchestration services
  • Observability and integration
    • Familiarity with logging and metrics solutions such as CloudWatch, Prometheus, Grafana, or similar
    • Experience exposing machine learning services through restful application programming interfaces or similar integration methods

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.