We are looking for an experienced Enterprise Observability & AIOps Architect to design, modernize, and lead enterprise-scale observability ecosystems spanning applications, infrastructure, cloud platforms, databases, and operational workflows.
The ideal candidate will combine strategic architectural leadership with strong hands-on expertise in modern observability and AIOps platforms, driving operational excellence and AI-driven transformation across large enterprise environments.
Key Responsibilities
Enterprise Observability Architecture
- Lead enterprise-wide observability assessments across applications, infrastructure, cloud, and databases
- Define current-state and target-state architectures
- Drive monitoring rationalization and tool consolidation strategies
- Establish standards for telemetry, tagging, service identity, alerting, and dashboards
- Define scalable operating models aligned with SRE, ITSM, and platform engineering
Application Observability
- Architect solutions for:
- APM, distributed tracing, logs & metrics, RUM, synthetic monitoring
- Define SLI/SLO-driven monitoring strategies
- Improve service visibility, dependency mapping, and telemetry quality
- Build observability for microservices, APIs, Kubernetes, Azure-native & legacy systems
Infrastructure & Platform Observability
- Design observability across cloud, middleware, databases, and batch systems
- Analyze alert duplication, routing inefficiencies, and monitoring overlaps
- Define event correlation, severity models, enrichment, and ownership frameworks
AIOps & Intelligent Operations
- Design and implement:
- Event correlation & noise reduction
- Intelligent alert prioritization
- Anomaly detection & predictive insights
- Root cause analysis & contextualization
- Enable AI-driven workflows for:
- Incident reduction
- MTTR optimization
- Automated remediation
ITSM & Operational Integration
- Integrate observability tools with ServiceNow, CMDB, and incident workflows
- Define monitoring-to-incident processes and governance frameworks
- Establish KPI-driven operational maturity models
Governance & Blueprinting
- Develop enterprise standards, onboarding blueprints, and playbooks
- Define reusable observability patterns and reference architectures
- Establish Day-1 observability models for new services
Required Experience
- 15+ years in observability, SRE, platform engineering, AIOps, or production operations
- Proven experience in enterprise observability transformation and monitoring rationalization
- Strong background in hybrid cloud and distributed systems
- Experience working with executives, enterprise architects, and platform teams
- Deep understanding of incident management and reliability engineering
Technical Expertise
Observability Tools (Must-Have)
- Dynatrace
- Azure Monitor
- Azure Application Insights
- Azure Log Analytics
- LogicMonitor
- ManageEngine
Preferred Tools
- Splunk, ELK / OpenSearch
- Prometheus / Grafana
- Datadog, New Relic
- BigPanda, PagerDuty
Core Skills
- Event correlation & alert engineering
- Distributed tracing & topology mapping
- AIOps & intelligent operations
- Cloud telemetry & monitoring
- Kubernetes & microservices observability
- ITSM (ServiceNow) integration
- SRE principles & operational governance
Cloud & Platform
- Azure, AWS
- Kubernetes & container platforms
- APIs & integrations
- Middleware & distributed systems
Mandatory Skills
- Enterprise Observability Architecture
- OpenTelemetry framework design
- APM & cloud monitoring expertise
- ITSM integration & event correlation
- AIOps & anomaly detection
- Kubernetes & microservices monitoring
- Alert optimization & noise reduction
- SLI/SLO framework design
- Integration architecture & governance