Overview
On Site
USD 100,600.00 - 199,000.00 per year
Full Time
Skills
Pivotal
Startups
Data Centers
Reliability Engineering
Problem Management
Customer Experience
Customer Facing
Process Engineering
Recovery
Accountability
Real-time
Decision-making
Engineering Support
Performance Management
Preventive Maintenance
Project Management
Production Support
Migration
ROOT
Continuous Improvement
Documentation
FOCUS
Customer Support
Partnership
Scalability
Collaboration
Management
C
C++
C#
Java
JavaScript
Python
Computer Science
Information Technology
Mechanical Engineering
Electrical Engineering
Aerospace
Data Science
Cyber Security
Software Engineering
Network Engineering
Systems Engineering
Crisis Management
Communication
Clarity
Strategic Thinking
Analytical Skill
Team Leadership
Microsoft Windows
Linux
Cloud Architecture
Disaster Recovery
Business Continuity Planning
Performance Tuning
Grafana
Splunk
New Relic
CHAOS
High Availability
IaaS
Incident Management
Artificial Intelligence
Machine Learning (ML)
Amazon Web Services
DevOps
Microsoft Azure
Google Cloud Platform
Google Cloud
ITIL
Integrated Circuit
Internal Communications
IC
SAP BASIS
PASS
Cloud Computing
Legal
Recruiting
Microsoft
Job Details
Are you passionate about cloud computing, obsessed with customer experience, and driven to resolve complex issues under pressure? Do you thrive in high-stakes, live environments and want to play a pivotal role in ensuring the reliability of Microsoft's cloud platform? If so, the Azure Customer Experience (CXP) team has the opportunity for you.
Microsoft Azure is one of the most exciting and strategic products at Microsoft-powering mission-critical workloads for enterprises, governments, and startups around the world. Azure delivers on-demand, hyper-scale infrastructure and platforms via Microsoft's global data centers, enabling customers to build, host, and scale their applications with confidence.
The Customer Reliability Engineering (CRE) team within Azure CXP is a top-level pillar of Azure Engineering responsible for world-class live-site management, customer reliability engagements, modern customer-first experiences for scale, and drives deep customer insights and empathy into the broader Azure Engineering organization. Our "no dead-end's" philosophy ensures that every customer, regardless of size or scale, can realize their full potential through the Microsoft Cloud
We are seeking Service Engineer II for Live Site Issues, Problem Management and driving Customer reliability space. This role is accountable for enhancing the customer experience across Azure, including First Party Services. The ideal candidate will demonstrate strong breadth in managing complex, highly available services, paired with deep technical expertise in Azure Core Services and their inter dependencies. You will work closely with Customers, First Parties, Customer Support, Livesite, and Engineering teams to deliver critical, customer-facing features. Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met.
In addition, this role includes on-call responsibilities for managing and resolving complex multi-service outages. It requires the ability to remain effective under pressure, apply broad technical and analytical skills, and coordinate seamlessly with internal service teams and stakeholders. Strong communication skills-both written and verbal-are essential. You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You'll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base.
This role offers a unique opportunity to make an immediate impact, improve systems at scale.
Responsibilities:
To be successful in this role, you must have a great track record of customer compassion, an engineering mindset, an innate aptitude for agility, and technical excellence in software engineering. Collaborate closely with Engineering/PM to ensure the availability, performance of Live Site and the satisfaction of our customers
Qualifications:
Required Qualifications:
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: ;br>
Microsoft will accept applications and processes offers for these roles on an ongoing basis.
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Every day, our customers stake their business and reputation on our cloud. You can help #AzCXP provide our customers with the world-class cloud services they need to succeed. #azcre
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Microsoft Azure is one of the most exciting and strategic products at Microsoft-powering mission-critical workloads for enterprises, governments, and startups around the world. Azure delivers on-demand, hyper-scale infrastructure and platforms via Microsoft's global data centers, enabling customers to build, host, and scale their applications with confidence.
The Customer Reliability Engineering (CRE) team within Azure CXP is a top-level pillar of Azure Engineering responsible for world-class live-site management, customer reliability engagements, modern customer-first experiences for scale, and drives deep customer insights and empathy into the broader Azure Engineering organization. Our "no dead-end's" philosophy ensures that every customer, regardless of size or scale, can realize their full potential through the Microsoft Cloud
We are seeking Service Engineer II for Live Site Issues, Problem Management and driving Customer reliability space. This role is accountable for enhancing the customer experience across Azure, including First Party Services. The ideal candidate will demonstrate strong breadth in managing complex, highly available services, paired with deep technical expertise in Azure Core Services and their inter dependencies. You will work closely with Customers, First Parties, Customer Support, Livesite, and Engineering teams to deliver critical, customer-facing features. Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met.
In addition, this role includes on-call responsibilities for managing and resolving complex multi-service outages. It requires the ability to remain effective under pressure, apply broad technical and analytical skills, and coordinate seamlessly with internal service teams and stakeholders. Strong communication skills-both written and verbal-are essential. You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You'll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base.
This role offers a unique opportunity to make an immediate impact, improve systems at scale.
Responsibilities:
To be successful in this role, you must have a great track record of customer compassion, an engineering mindset, an innate aptitude for agility, and technical excellence in software engineering. Collaborate closely with Engineering/PM to ensure the availability, performance of Live Site and the satisfaction of our customers
- Lead and manage high-severity incidents across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication.
- Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams.
- Contribute to the design of V. Next architecture for Cloud infrastructure services, based on Customer/ First party engagements.
Engage in major production triage efforts and work with different teams in the identification of root cause of highly impactful or complex issues as required and identify Product gaps and work with Product teams to bridge the gaps. - Partner closely with Software developers, Product Managers, architects, and Infrastructure teams to drive delivery of sustainable and reusable design solution patterns to ensure non-functional production support requirements are adopted early in the Migration /Deployment
- Promote a customer-first culture by prioritizing availability, reliability, and platform trust in every response.
- Participate in the on-call rotation.
- Analyze customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements.
- Drive continuous improvement of the Azure platform by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability.
- Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies.
- Identify and advocate for customer self-service capabilities, improved documentation, and scalable solutions that empower customers to resolve common issues independently.
- Design and drive adoption of incident response playbooks, mitigation levers, and operational frameworks aligned to real-world support scenarios and strategic customer needs.
- Contribute to the design of next-generation architecture for cloud infrastructure services with a focus on reliability and strategic customer support outcomes.
Build and maintain cross-functional partnerships, ensuring alignment across engineering, business, and support organizations. - Be data-driven and results-focused, using metrics to evaluate incident response effectiveness and platform health.
- Bring an engineering mindset to operational challenges, balancing agility, scalability, and technical excellence.
- Exhibit strong cross-team collaboration, engineering mindset, and results-oriented execution under pressure
Qualifications:
Required Qualifications:
- Bachelor's degree in Computer Science, Information Technology, Data Science, Cybersecurity, or a related field AND 2+ years of technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls;
- OR equivalent hands-on experience.
- Proven experience in cloud operations, incident & crisis management, or large-scale systems engineering ideally within platforms such as Azure, AWS, or Google Cloud Platform.
- Demonstrated experience in 247365 enterprise environments, managing mission-critical services.
- Demonstrated experience implementing AI-driven solutions and automation, with proficiency in one or more programming/automation languages (e.g., C, C++, C#, Java, JavaScript, Python) or equivalent expertise.
- ITIL, SRE, or other industry-recognized technical and operational certification.
- Master's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls
- OR Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 5+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls
- OR equivalent experience.
- 1+ year(s) technical experience working with large-scale cloud or distributed systems.
- 3+ Years of demonstrated experience as an Incident Management or Crisis Management for critical, high-severity incidents in high-availability, distributed environments.
- Experience with Service Engineering principles and practices with exceptional command-and-control communication skills-able to drive clarity and direction with customers - internal Microsoft stake holders andthird-partyvendors during ambiguity and chaos.
- Demonstrated ability to make decisions quickly with strategic thinking under high pressure situations with analytical skills, demonstrating team leadership quality, and collaborationwith peer teams and internal engineering partners.
- Desiredstrong knowledge of Windows or Linux platforms, developer tools andabilityto diagnose cloud computing platform issues, identifying patterns and implementing AI-driven approach for overall platform stability and reliability.
- Deep understanding of cloud architecture patterns, High Availability, Disaster Recovery, Business Continuity, Performance Tuning for service platform services.
- Familiarity with monitoring and observability tools (e.g., Azure Monitor, Watch Dog, Grafana, Prometheus, Datadog, Splunk, New Relic).
- Exposure to chaos engineering, fault injection, or high availability architecture.
- AI/ML Experience: [Beginner to Intermediate]
- Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes.
- Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting.
- An understanding of the challenges and risks associated with AI/ML systems in a production environment.
- Certifications:
- Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, Google Cloud Platform Professional Cloud Architect).
- Certifications in ITIL, SRE, or other relevant frameworks.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: ;br>
Microsoft will accept applications and processes offers for these roles on an ongoing basis.
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Every day, our customers stake their business and reputation on our cloud. You can help #AzCXP provide our customers with the world-class cloud services they need to succeed. #azcre
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.