Overview
Skills
Job Details
Job Overview:
We are looking for an experienced Enterprise IT Monitoring and Observability Architect to join our direct client, at their Manassas, VA location. This strategic role reports directly to the Head of Service & Data Architecture and is responsible for the technical design and evolution of real-time monitoring, alerting, and reporting solutions across the enterprise infrastructure and applications.
The successful candidate will work closely with infrastructure and application engineers, as well as operations teams, to continuously enhance the scope, quality, and effectiveness of monitoring solutions. This role plays a vital part in enabling efficient event management and delivering an optimal user experience.
Key Responsibilities:
Lead the detailed technical design and implementation of monitoring and observability solutions for enterprise IT systems, including networks, servers, storage, databases, and applications
Define and maintain standards, architectural patterns, and best practices for monitoring technologies across the organization
Develop and manage technical roadmaps and ensure alignment with organizational goals
Drive the integration of monitoring solutions with incident management, analytics, and reporting platforms
Partner with multi-disciplinary teams to deliver innovative services, tools, and applications that enhance operational efficiency
Improve the user experience and streamline event and incident management through enhanced observability
Produce high-quality technical documentation and ensure knowledge sharing across IT teams
Leverage modern monitoring tools, including OpenTelemetry, and support SRE principles
Apply design thinking and systems thinking methodologies to drive innovation and sustainability in monitoring strategies
Required Skills & Experience:
4+ years of experience in Enterprise Architecture, with a focus on monitoring and observability
Proficiency in OpenTelemetry, including implementation and integration in distributed systems
Strong background in Site Reliability Engineering (SRE) practices and principles
Hands-on experience designing and deploying enterprise-wide monitoring platforms
Experience with Red Hat Enterprise Linux (RHEL), shell scripting, and automation tools (e.g., Python, Perl)
Familiarity with monitoring tools such as Prometheus, Grafana, Splunk, or similar
Deep understanding of IT infrastructure domains (network, servers, storage, databases)
Proven ability to develop and maintain architectural roadmaps and technical designs
Experience using TOGAF or similar enterprise architecture frameworks
Skilled in technical documentation and the communication of complex solutions to diverse audiences
Demonstrated expertise in applying Design Thinking and Systems Thinking methodologies
Nice to Have:
Prior experience in financial services or high-availability enterprise environments
CI/CD pipeline experience and DevOps culture familiarity
Cloud monitoring or hybrid infrastructure monitoring experience