Distributed Performance Engineer

Overview

On Site
$100,000 - $110,000
Full Time

Skills

Root Cause Analysis
Switches
OnPrem
Grafana
performance
DataDog
Distributed Performance Engineer
Tier 4
Splunk
Routers
AWS
Riverbed

Job Details

Distributed Performance Engineer (DPE)

As a Tier 4 LoB-Facing Internal Consulting Engineer specializing in performance, you will conduct in-depth forensics network and application studies for production issues already investigated by numerous cross-technical Tier 1, 2 and 3 Teams yet remains negatively impacting client revenue, profit and/or reputation.

Independently diagnose root cause of the performance production issue principally relying on network packet analysis of business transactions as they cross distributed systems both globally OnPrem and Public Cloud Data Center Tiers to identify the failed component (software and/or infrastructure) responsible for the failure.

Author, publish & present detailed formal Findings, Analysis, and Recommendation Reports to the Product Owner and Senior Leadership responsible for the failed component.

For Infrastructure-based failures, lead the OnPrem and/or Public Cloud (AWS, Azure, Google) Infrastructure Team (compute, network and/or storage) for remediation of the failed component (e.g., firewall, circuit, disk).

For Software-based failures, lead the Application Software Development Team, either internally or a vendor, for remediation of the failed software module (e.g., application code, SQL, Messaging).

1. Interview Customers & Review Prior Incident Reports

Conduct detailed interviews with the customers to gather information about the poor end user experience and/or slow business transactions.

Review written incident reports previously written by the infrastructure and/or application teams to understand the initial findings and reported issues.

2. Collect Forensic Evidence Collected by Other Teams

IP Addresses for each endpoint

Architecture diagrams of the systems

Application and infrastructure logs

Data Center network diagrams

Performance reports detailing the incident.

3. Create Network Topology

Identify Data Center hosting each processing tier.

Identify the in-between networks (primary & redundant)

Create a new topology map of the flows for the application.

4. Network Packet Collection Points

Research network taps & ER-SPANs which can collect traffic of interest.

Configure packet brokers to collect traffic.

Identify the in-between networks (primary & redundant)

Create a new topology map of the flows for the application.

Network topologies.

Performance reports detailing the incident.

Collect TCP/IP network packets from strategic network locations where the relevant network traffic traverses the client Backbone Network and/or connects to Public Cloud Platforms (AWS, Azure, Google Cloud Platform).

5. Findings and Analysis Report

Publish a detailed Findings & Analysis Report for review by the Product Owner of the Processing Tier responsible for the slowdown. The report should include:

An overview of the profiling results.

Identified bottlenecks and their locations.

Any anomalies or irregularities observed during packet analysis.

6. Collaboration for Root Cause Analysis

Collaborate closely with the Product Team to perform a deeper analysis of the specific circumstances leading to the performance issue.

Identify the root cause of the problem and recommend remediation steps.

7. Remediation and Resolution

Work with the responsible Product Owner to implement the recommended remediation steps.

Ensure that the issue is resolved and the Service Level Agreement (SLA) thresholds are met once again.

Continue monitoring and adjusting as necessary to maintain performance standards.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.