Role: Observability Architect
Location: Atlanta, GA Onsite 5 days
Project : Azure to Splunk Migration
Job description:
We are seeking an experienced Observability Architect to design, implement, and mature enterprise-wide observability capabilities across hybrid on-premises and cloud environments.
The ideal candidate has deep expertise with log aggregation, metrics, tracing, and application performance monitoring technologies, and can drive automation, standardization, and best-practice adoption at scale.
This role will be a key influencer in shaping the organization's observability strategy, ensuring end-to-end system visibility, performance, and reliability.
Key Responsibilities
Observability Architecture & Strategy
- Develop and maintain the enterprise observability reference architecture, covering logs, metrics, traces, events, dashboards, and alerts.
- Lead the design and implementation of observability solutions that support hybrid multi-cloud and on-premise environments.
- Establish standards, governance, and reusable frameworks for telemetry generation, ingestion, correlation, storage, and visualization.
- Drive continuous improvement of monitoring maturity, integrating data-driven insights and AI-based analytics where applicable.
Log Aggregation & Monitoring Solutions
- Architect and administer large-scale log aggregation platforms such as Splunk, supporting both on-prem and cloud deployments.
- Define and automate ingestion pipelines, parsing logic, index strategies, role-based access, and performance tuning.
- Implement configuration management and infrastructure-as-code (IaC) practices for repeatable deployment and scaling of observability tools.
Application & Network Performance Monitoring
- Deploy, configure, and optimize APM solutions such as AppDynamics, Dynatrace, or equivalent platforms.
- Integrate application tracing, synthetic monitoring, real-user monitoring (RUM), and business transaction analytics.
- Support and enhance Network Performance Monitoring (NPM) capabilities to ensure end-to-end visibility across distributed systems.
Cloud-Native & Modern Monitoring
- Leverage cloud-native monitoring tools across AWS, Azure, or Google Cloud Platform (e.g., CloudWatch, Azure Monitor, Google Cloud Platform Operations Suite).
- Guide teams in instrumenting microservices, serverless functions, containers, and Kubernetes clusters using OpenTelemetry and modern telemetry standards.
- Partner with infrastructure, application, and SRE teams to ensure high availability, resilience, and performance.
Automation & AI-Driven Engineering
- Build automated workflows for alert tuning, anomaly detection, dashboards, and telemetry enrichment.
- Explore and integrate AI/ML-based observability features such as predictive analytics, signal correlation, and automated root-cause analysis.
- Advocate for automation-first practices and reduction of operational toil.
Required Qualifications
- 5+ years of hands-on experience with enterprise-scale log aggregation platforms, including architecture, deployment, and administration of tools like Splunk across on-prem and cloud environments.
- 5+ years of experience using automated configuration management and IaC tools (e.g., Ansible, Terraform, GitOps frameworks).
- 2+ years of experience with APM tools such as AppDynamics or Dynatrace, including end-to-end application visibility and performance diagnostics.
- Experience with Network Performance Monitoring tools and methodologies.
- Strong understanding of cloud infrastructure and cloud-native monitoring technologies (AWS, Azure, Google Cloud Platform).
- Familiarity with OpenTelemetry, distributed tracing, and service mesh observability.
- Expertise in designing dashboards, KPIs, and alerting strategies that align to business SLIs/SLOs.
- Experience collaborating with DevOps, SRE, cloud engineering, and application teams in large enterprises.
Preferred Qualifications
- Experience implementing AI/ML-driven observability capabilities (e.g., anomaly detection, auto-baselining, correlation engines).
- Knowledge of container ecosystems and orchestration platforms (Kubernetes, AKS/EKS/GKE).
- Experience working with event-driven architectures and microservices environments.
- Strong scripting or programming skills (Python, PowerShell, Bash, etc.).
- Relevant certifications (e.g., Splunk Architect, Dynatrace Professional, Cloud certifications).