AI-Factory Observability Principal

Remote in Remote, WA, US • Posted 3 hours ago • Updated 3 hours ago
Full Time
On-site
USD $173,600.00 - 180,000.00 per year
Company Branding Image
Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

  • Managed Services
  • IT Infrastructure
  • Product Engineering
  • ServiceNow
  • Service Delivery
  • Knowledge Base
  • IT Transformation
  • FOCUS
  • Productivity
  • Partnership
  • Data Centers
  • Instrumentation
  • Data Retention
  • Data Modeling
  • UPS
  • Real-time
  • Capacity Management
  • Recovery
  • SAP ERP
  • CheckPoint
  • Ethernet
  • Communication
  • Root Cause Analysis
  • Electrical Engineering
  • ROOT
  • Natural Language
  • Leadership
  • MEAN Stack
  • Mentorship
  • Technical Direction
  • Roadmaps
  • Dynatrace
  • Grafana
  • Mechanical Engineering
  • Electrical Systems
  • Distribution
  • System Monitoring
  • Linux
  • Computer Hardware
  • Management
  • IPMI
  • Network Monitoring
  • WMI
  • Training
  • Artificial Intelligence
  • Forecasting
  • Python
  • Splunk
  • Modbus
  • OPC
  • MQTT
  • GPU
  • InfiniBand
  • Remote Direct Memory Access
  • Network
  • Streaming
  • SNMP
  • NetFlow
  • Time Series
  • Data Lake
  • Warehouse
  • Machine Learning Operations (ML Ops)
  • Machine Learning (ML)
  • Optimization
  • Sustainability
  • Energy
  • Kubernetes
  • Cloud Computing
  • Computer Networking
  • SAP BASIS
  • Law
  • Innovation
  • Recruiting

Summary

Company Overview

Milestone Technologies is a global IT managed services firm that partners with organizations to scale their technology, infrastructure and services to drive specific business outcomes such as digital transformation, innovation, and operational agility. Milestone is focused on building an employee-first, performance-based culture and for over 25 years, we have a demonstrated history of supporting category-defining enterprise clients that are growing ahead of the market. The company specializes in providing solutions across Application Services and Consulting, Digital Product Engineering, Digital Workplace Services, Private Cloud Services, AI/Automation, and ServiceNow. Milestone culture is built to provide a collaborative, inclusive environment that supports employees and empowers them to reach their full potential.

Our seasoned professionals deliver services based on Milestone's best practices and service delivery framework. By leveraging our vast knowledge base to execute initiatives, we deliver both short-term and long-term value to our clients and apply continuous service improvement to deliver transformational benefits to IT. With Intelligent Automation, Milestone helps businesses further accelerate their IT transformation. The result is a sharper focus on business objectives and a dramatic improvement in employee productivity. Through our key technology partnerships and our people-first approach, Milestone continues to deliver industry-leading innovation to our clients. With more than 3,000 employees serving over 200 companies worldwide, we are following our mission of revolutionizing the way IT is deployed.

Job Overview

AI-Factory Observability Principal

We are operating large-scale AI training and inference data centers, and we need an expert who can see the entire stack at once - from the chiller plant and switchgear to the GPU fabric and the Kubernetes scheduler. This role spans facilities/OT telemetry (cooling, power) and IT/AI infrastructure observability (compute, network, accelerators), unified by a single goal: complete, real-time, predictive visibility into how AI infrastructure consumes power, generates heat, moves data, and delivers compute.
You will design the observability platform that ingests signals from building and electrical systems, server and network fabrics, Kubernetes, and GPU/accelerator clusters - then apply AI/ML models on top of that telemetry to optimize utilization, predict failures, reduce energy cost, and surface insights operators can act on. You are equally comfortable reading a BACnet point list and a GPU NVLink topology, and you can explain to both facilities and platform teams why their data belongs in the same system.

Observability architecture & strategy
  • Define and own the end-to-end observability architecture covering metrics, logs, traces, and events across facilities and IT domains.
  • Establish standards for instrumentation, telemetry pipelines, data retention, cardinality management, and a unified data model that lets power, thermal, network, and compute signals be correlated in one place.
  • Design for scale: hundreds of thousands of time series per site, high-frequency power and thermal sampling, and GPU-cluster-level granularity.

Facilities & OT integration (BMS / EPMS)
  • Integrate Building Management System (BMS) telemetry - CRAC/CRAH units, chillers, cooling loops, airflow, temperature/humidity, leak detection - into the central observability platform (BACnet, Modbus, MQTT, OPC-UA).
  • Integrate Electrical Power Monitoring System (EPMS) data - switchgear, UPS, PDUs, busways, branch-circuit metering, generators - for real-time power draw, capacity, and quality monitoring (Modbus, DNP3, IEC 61850).
  • Build correlated views of power and thermal behavior against compute workload so operators understand cause and effect (e.g., a training job's effect on rack power and inlet temperatures).
  • Partner with facilities engineering on PUE, capacity planning, stranded-power recovery, and thermal optimization.

AI cluster & Kubernetes observability
  • Architect observability for AI/GPU clusters - accelerator utilization, memory pressure, thermals, ECC/Xid errors, power capping, and job-level efficiency (e.g., via NVIDIA DCGM, accelerator telemetry exporters).
  • Instrument Kubernetes environments running AI/ML workloads: cluster, node, pod, and workload metrics, scheduler behavior, GPU/accelerator allocation, and operator health.
  • Provide visibility into training and inference pipelines - throughput, queue depth, checkpoint behavior, straggler detection, and cost-per-token / cost-per-training-step metrics.
  • Surface noisy-neighbor, fragmentation, and underutilization patterns across multi-tenant clusters.

Network observability
  • Design monitoring for high-performance data center fabrics, including the AI back-end network (RDMA, InfiniBand and/or RoCE Ethernet) and front-end/management networks.
  • Capture fabric health, congestion, link errors, latency, and bandwidth utilization using streaming telemetry, SNMP, gNMI/gRPC, NetFlow/sFlow, and fabric managers (e.g., InfiniBand UFM).
  • Correlate network behavior with distributed training performance to diagnose collective-communication bottlenecks.

AI/ML-driven optimization & insight (AIOps)
  • Apply ML and AI models to the telemetry estate for anomaly detection, predictive maintenance, capacity forecasting, and automated root-cause analysis.
  • Build models and pipelines that recommend (or automate) actions: dynamic cooling and power optimization, workload placement, power capping under thermal/electrical constraints, and failure pre-emption.
  • Leverage LLMs and modern AI techniques to summarize incidents, accelerate root-cause investigation, query telemetry in natural language, and generate operator-facing insights from large volumes of logs and metrics.
  • Establish the feedback loop where observability data trains the models that, in turn, optimize the infrastructure being observed.

Cross-functional leadership
  • Act as the technical authority connecting facilities, network, platform, SRE, and AI/ML teams around a shared observability practice.
  • Define SLOs, alerting strategy, and on-call signal quality; drive down alert noise and mean-time-to-resolution.
  • Mentor engineers and set the technical direction for the observability roadmap.

Required Qualifications
  • 8+ years in infrastructure, SRE, observability, or data center engineering, with 3+ years in an architect or principal-level role.
  • Demonstrated experience designing and operating observability platforms at scale (metrics, logs, traces).
  • Expertise in Datadog, Dynatrace, Grafana, Prometheus and Grafana.
  • Hands-on experience integrating BMS and EPMS data, and a working understanding of data center mechanical and electrical systems (cooling topologies, power distribution, redundancy, capacity).
  • Strong systems monitoring background - Linux/server fleets, hardware health, baseboard management (IPMI/Redfish).
  • Strong network monitoring background, including high-performance / low-latency fabrics relevant to AI workloads. Expertise in SNMP, WMI.
  • Production experience with Kubernetes and observability of containerized workloads.
  • Experience operating or monitoring GPU / AI-accelerator clusters and understanding of distributed training/inference behavior.
  • Practical experience applying AI/ML models to operational data (anomaly detection, forecasting, or AIOps), and comfort using LLMs to derive insights and automate analysis.
  • Proficiency in at least one language for data/automation work (Python preferred), and infrastructure-as-code practices.

Preferred Qualifications
  • Experience with tooling such as OpenTelemetry, VictoriaMetrics/Thanos, Loki, Tempo, Elastic, Splunk.
  • Familiarity with OT/industrial protocols: BACnet, Modbus, OPC-UA, DNP3, IEC 61850, MQTT.
  • Familiarity with GPU/accelerator telemetry (NVIDIA DCGM and exporters) and InfiniBand/RDMA monitoring (e.g., UFM).
  • Experience with network telemetry: gNMI/OpenConfig streaming, SNMP, NetFlow/sFlow.
  • Experience with time-series data at high cardinality, stream processing, and data lake/warehouse patterns for telemetry.
  • Background in MLOps, model deployment, or building data/feature pipelines for operational ML.
  • Exposure to power and cooling optimization, PUE improvement, or sustainability/energy-efficiency initiatives.
  • Relevant certifications (e.g., data center facilities, Kubernetes/CKA, cloud or networking) are a plus.

Compensation

Estimated Pay Range: $173,600.00 - $180,000.00 USD/yr. We also offer comprehensive benefits options which vary depending on role, location, and employment type. The Talent Acquisition Partner can share more details about compensation or benefits for the role during the interview process.

Exact compensation and offers of employment are dependent on circumstances of each case and will be determined based on job-related knowledge, skills, experience, licenses or certifications, and location.

Our Commitment to Diversity & Inclusion

At Milestone we strive to create a workplace that reflects the communities we serve and work with, where we all feel empowered to bring our full, authentic selves to work. We know creating a diverse and inclusive culture that champions equity and belonging is not only the right thing to do for our employees but is also critical to our continued success.

Milestone Technologies provides equal employment opportunity for all applicants and employees. All qualified applicants will receive consideration for employment and will not be discriminated against on the basis of race, color, religion, gender, gender identity, marital status, age, disability, veteran status, sexual orientation, national origin, or any other category protected by applicable federal and state law, or local ordinance. Milestone also makes reasonable accommodations for disabled applicants and employees.

We welcome the unique background, culture, experiences, knowledge, innovation, self-expression and perspectives you can bring to our global community. Our recruitment team is looking forward to meeting you.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90666340
  • Position Id: 13244
  • Posted 3 hours ago

Company Info

About Milestone Technologies, Inc.

Milestone Technologies is a global IT managed services firm that partners with organizations to scale their technology, infrastructure, and services to drive specific business outcomes such as digital transformation, innovation, and operational agility. Milestone is focused on building an employee-first, performance-based culture and for over 25 years, we have a demonstrated history of supporting category-defining enterprise clients that are growing ahead of the market.

The company specializes in providing solutions across Application Services and Consulting, Digital Product Engineering, Digital Workplace Services, Private Cloud Services, AI/Automation, and ServiceNow.

Milestone culture is built to provide a collaborative, inclusive environment that supports employees and empowers them to reach their full potential.

 

About_Company_One
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Today

Full-time

Remote

Today

Full-time

Remote or Gurugram, Haryana

Today

Full-time

Remote

Today

Full-time

Search all similar jobs