Job Description:
The Senior Staff Engineer for NPE Observability is the preeminent technical strategist for Client''s global telemetry fabric. In this senior contract role, you will bridge the gap between high-scale distributed software and global network hardware, driving the architectural standards for our most complex data-intensive initiatives. You will own the technical integrity of our streaming pipelines, ensuring telemetry from the global fleet is ingested, normalized, and processed with sub-second latency. As a master of our tech stack (Java, Kafka, Postgres, Grafana), you will define the "Gold Standard" for technical excellence within the Network Platform Engineering (NPE) group.
Responsibilities
Architectural Strategy & Technical Vision
• Core Stack Evolution: Architect and optimize our primary ingestion and storage engines utilizing Java and PostgreSQL, ensuring high availability and performance at scale.
• Real-Time Data Orchestration: Lead the design of high-throughput messaging systems using Apache Kafka to handle trillions of telemetry points with sub-second latency.
• Unified Visibility: Define the global standard for observability visualization in Grafana, building complex, high-performance dashboards that aggregate data from diverse telemetry sources.
High-Scale Engineering & Innovation
• Stream Processing Mastery: Architect massively parallel processing pipelines and stateful stream processing frameworks (utilizing tools like Apache Flink) to enable real-time anomaly detection.
• Advanced R&D: Evaluate and prototype emerging technologies such as Model-Driven Telemetry (MDT) and ClickHouse/Thanos for long-term metric storage and high-cardinality data analysis.
• Technical Roadmap Ownership: Drive the engineering team toward key milestones, ensuring the code we ship aligns with the 3–5 year long-term NPE vision.
Reliability & Systemic Leadership
• Service Standards: Define and monitor critical SLI/SLO metrics (e.g., P95 response times) to ensure the platform maintains world-class performance and global ITIL compliance.
• Incident Authority: Serve as the senior point of contact for complex root-cause analysis, identifying architectural weaknesses in the Java/Kafka/Postgres stack to prevent future outages.
• Stakeholder Synthesis: Translate complex product requirements into deep technical specifications, managing relationships with both internal software teams and external network vendors.
Required Qualifications & Experience
• Tenure: 10+ years of professional experience in software engineering and distributed systems.
• Domain Expertise: 5+ years of experience specifically in large-scale network engineering, telemetry, or observability platforms.
• Java Expert: Mastery of Java for building high-performance, scalable backend services.
• Data & Messaging: Deep expertise in PostgreSQL (schema design and tuning) and Apache Kafka (cluster architecture and stream management).
• Visualization: Expert-level proficiency in Grafana for creating enterprise-level observability dashboards.
• Large-Scale Systems: Proven experience with Prometheus, Thanos, or ClickHouse and working within a structured Agile/Scrum environment.
• Education: Bachelor''s or Master''s degree in Computer Science or a related technical field