At eBay, we're more than a global ecommerce leader - we're changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We're committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.
Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work - every day. We're in this together, sustaining the future of our customers, our company, and our planet.
Join a team of passionate thinkers, innovators, and dreamers - and help us connect people and build communities to create economic opportunity for all.
About the team and role:The Observability Platform team, part of eBay's core Site Reliability Engineering (SRE) organization, is dedicated to enhancing the reliability, performance, and efficiency of eBay's global platform. Our mission is to build intelligent, scalable tools and solutions that empower our SRE and domain engineering teams to maintain operational excellence.We develop and maintain a suite of advanced, AI-driven systems by leveraging a wealth of operational data. Our real-time anomaly detection platform analyzes high-volume time-series metrics to predict and flag service degradations. We automate troubleshooting with a sophisticated root cause analysis engine that correlates metrics, events, logs, and traces to pinpoint failure origins. Furthermore, we are pioneering the use of GenAI to build an LLM-based agentic system to automate complex operational tasks, and a novel suite of AI-powered explainability tools to clarify the behavior of distributed systems.What You Will Accomplish:- Advance our anomaly detection capabilities, developing and productionalizing time-series models (both statistical and NN-based) on real-time metric streams.
- Enhance our automated root cause analysis engine by applying advanced correlation techniques and machine learning models to pinpoint the source of system failures from metrics, events, logs, and traces.
- Develop innovative GenAI/LLM-powered tools and drive the evolution of our existing solutions, such as an LLM-based agent for automating operations and a suite of AI-powered explainers for diagnosing complex system behaviors.
- Design and develop scalable data pipelines to process massive volumes of observability data that fuel all our ML/AI systems.
- Collaborate closely with SREs, platform architects, and domain engineering teams to understand their operational challenges and deliver solutions that improve system reliability and reduce mean time to resolution (MTTR).
- Own the entire software and model lifecycle, from initial design and prototyping to development, testing, deployment, and operational maintenance.
What You Will Bring:- MS in Computer Science or a related field with 4+ years of relevant experience (or BS/BA with 6+ years) in Software Engineering or Machine Learning.
- Strong hands-on experience applying machine learning to operational data, including time-series analysis, anomaly detection, or NLP on system logs and traces.
- Proven experience with AI/GenAI, including hands-on work with Large Language Models (LLMs), prompt engineering, and building agentic systems or RAG (Retrieval-Augmented Generation) applications.
- Strong programming skills in languages like Python or Go.
- Hands-on experience with the operational side of machine learning, including model deployment, monitoring, and lifecycle management using tools like Kubernetes and Docker.
- Experience with ML frameworks like PyTorch, TensorFlow, or scikit-learn.
- Strong understanding of SQL and NoSQL databases.
- Experience with time-series or analytical databases (e.g., Prometheus, ClickHouse) is a significant plus.
- Experience with core components of modern observability stacks (e.g., metrics collection/storage like Prometheus; logging like Loki; tracing like Jaeger/Tempo; visualization like Grafana) and container orchestration platforms like Kubernetes is a significant plus
- Excellent analytical, problem-solving, and communication skills.
The base pay range for this position is expected in the range below:
$147,200 - $196,500
Base pay offered may vary depending on multiple individualized factors, including location, skills, and experience. The total compensation package for this position may also include other elements, including a target bonus and restricted stock units (as applicable) in addition to a full range of medical, financial, and/or other benefits (including 401(k) eligibility and various paid time off benefits, such as PTO and parental leave). Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
If hired, employees will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
Additional DetailseBay is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, veteran status, and disability, or other legally protected status. If you have a need that requires accommodation, please contact us at We will make every effort to respond to your request for accommodation as soon as possible. View our accessibility statement to learn more about eBay's commitment to ensuring digital accessibility for people with disabilities. It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.
We use cookies to enhance your experience and may use AI tools for administrative tasks in the hiring process. To learn how we handle your personal data and use AI responsibly, please visit our Talent Privacy Notice, Privacy Center, and AI Hiring Guidelines.