Overview
On Site
Full Time
Skills
PB
Backbone.js
Business Operations
IaaS
Conflict Resolution
Problem Solving
Network Security
Enterprise Storage
Design Architecture
Optimization
Evaluation
SDS
Storage Architecture
Lifecycle Management
Firmware
System Integration
Linux Kernel
MDS
Network
Ethernet
Root Cause Analysis
Shell
Scripting
Real-time
Log Analysis
Incident Management
Scalability
Innovation
High Availability
Virtual Machines
Research
Amazon S3
Swift
Benchmarking
Encryption
Access Control
Auditing
Data Management
Regulatory Compliance
PCI DSS
System On A Chip
HIPAA
Mentorship
Collaboration
Cloud Security
Open Source
Data Storage
Production Support
Management
Linux
Thread
Computer Networking
TCP/IP
VLAN
Border Gateway Protocol
OSPF
Load Balancing
Remote Direct Memory Access
LVM
OSD
Caching
Tcpdump
Wireshark
Debugging
Python
Shell Scripting
Configuration Management
Ansible
Puppet
Terraform
Docker
Kubernetes
LXC
Stacks Blockchain
Grafana
Cloud Computing
Microsoft Azure
Google Cloud Platform
Google Cloud
OpenStack
Amazon Web Services
Cloud Storage
Ceph
Replication
Backup
Disaster Recovery
Artificial Intelligence
Machine Learning (ML)
Storage
Performance Tuning
GPU
Computer Hardware
Computer Science
Computer Engineering
Information Systems
Software Engineering
Web Content
WCAG
Assistive Technology
Accessibility
Job Details
Position Summary...
We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10years+ of deep technical experience in distributed storage systems. This role is focused on hands-on architecture, operations, performance tuning, and troubleshooting of multi-petabyte scale storage clusters in mission-critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, with the ability to diagnose complex issues spanning hardware, kernel, and storage layers.
This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
What you'll do...
THIS ROLE DOES NOT PROVIDE SPONSORSHIP
Our Private Cloud Storage Engineering team is responsible for building and operating some of the largest-scale Ceph storage clusters in the industry, supporting mission-critical applications across Walmart's global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high-performance storage for business operations, customer platforms, and innovation workloads.
The team works at the intersection of distributed storage systems, Linux internals, networking, and cloud infrastructure, solving some of the toughest technical challenges in scalability, performance, and resilience. We embrace a culture of deep technical expertise, hands-on problem solving, and continuous learning, while driving adoption of automation, observability, and next-generation storage technologies.
As part of this team, you will collaborate with world-class engineers across compute, networking, security, and cloud to design end-to-end solutions, shape the future of enterprise storage platforms, and contribute to the broader open-source storage community.
Key responsibilities:
Scale-Out Distributed Storage Architecture
Ceph Storage Architecture & Operations
Large Scale OpenStack Platform Experience
Performance, Debugging & Troubleshooting
Automation & Observability
Scalability & Innovation
Security & Compliance
Collaboration & Mentorship
Qualifications
Preferred Skills
?
?
?
?
Minimum Qualifications...
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years' experience in software engineering or related area.
Option 2: 7 years' experience in software engineering or related area.
Preferred Qualifications...
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Master's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart's accessibility standards and guidelines for supporting an inclusive culture.
Primary Location...
1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America
Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.
We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10years+ of deep technical experience in distributed storage systems. This role is focused on hands-on architecture, operations, performance tuning, and troubleshooting of multi-petabyte scale storage clusters in mission-critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, with the ability to diagnose complex issues spanning hardware, kernel, and storage layers.
This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
What you'll do...
THIS ROLE DOES NOT PROVIDE SPONSORSHIP
Our Private Cloud Storage Engineering team is responsible for building and operating some of the largest-scale Ceph storage clusters in the industry, supporting mission-critical applications across Walmart's global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high-performance storage for business operations, customer platforms, and innovation workloads.
The team works at the intersection of distributed storage systems, Linux internals, networking, and cloud infrastructure, solving some of the toughest technical challenges in scalability, performance, and resilience. We embrace a culture of deep technical expertise, hands-on problem solving, and continuous learning, while driving adoption of automation, observability, and next-generation storage technologies.
As part of this team, you will collaborate with world-class engineers across compute, networking, security, and cloud to design end-to-end solutions, shape the future of enterprise storage platforms, and contribute to the broader open-source storage community.
Key responsibilities:
Scale-Out Distributed Storage Architecture
- Extensive experience in the design, architecture, and management of scale-out distributed storage systems in large production environments.
- Demonstrated expertise in system performance tuning, data durability optimization (replication and/or erasure coding), and lifecycle management for petabyte-scale data deployments.
- Proven ability to drive the evaluation, selection, and deployment of best-of-breed software-defined storage (SDS) solutions that meet demanding SLAs for latency, throughput, and availability.
Ceph Storage Architecture & Operations
- Architect, deploy, and manage large-scale clusters across multiple production sites.
- Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
- Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
- Own end-to-end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration.
Large Scale OpenStack Platform Experience
- Deep (hands-on architectural experience with the design, deployment, and management of large-scale OpenStack platforms in production environments.
- Expert-level knowledge of core OpenStack storage services, specifically Cinder (Block Storage), Swift (Object Storage), and/or the integration of Ceph or similar distributed storage solutions.
- Experience must include data center networking design, high-availability design and multi-region/multi-site OpenStack deployments.
Performance, Debugging & Troubleshooting
- Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale-Out storage solution, Linux kernel, networking, and hardware layers.
- Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging.
- Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
- Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage.
- Drive root cause analysis (RCA) for critical production issues and provide long-term remediation.
Automation & Observability
- Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
- Develop observability views for real-time monitoring of IOPS, throughput, latency, and cluster health.
- Automate alerting, log analysis, and anomaly detection for proactive incident response.
Scalability & Innovation
- Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
- Collaborate with compute and networking teams to integrate Storage clusters with Kubernetes, OpenStack, and VM workloads.
- Research and implement new features like CephFS, RGW S3/Swift gateways, Bluestore optimizations, RocksDB tuning.
- Evaluate next-gen hardware (NVMe SSDs, RDMA NICs, high-density HDDs) and their impact on storage performance.
- Evaluate next-gen server SKUs, perform benchmarking, and compare options to select the most appropriate storage hardware.
Security & Compliance
- Implement encryption (at-rest and in-transit), access controls, and audit mechanisms for secure data management.
- Ensure compliance with enterprise and regulatory standards (e.g., PCI-DSS, SOC, HIPAA).
Collaboration & Mentorship
- Act as technical SME for Storage within the organization, mentoring junior engineers.
- Collaborate with cross-functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
- Partner with hardware and software stakeholders and the Ceph community to drive adoption of best practices and contribute to open-source improvements.
Qualifications
- 15-18 years of experience in scale-out distributed storage systems, infrastructure engineering, and Linux systems.
- 10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support.
- Proven experience managing clusters at petabyte scale with high performance and resiliency requirements.
- Strong expertise in:
- Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging.
- Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames.
- Storage Internals: LVM, OSD design, Bluestore, RocksDB tuning, journaling, caching layers.
- Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF.
- Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing.
- Proficiency in Python and Shell scripting for automation and tooling.
- Hands-on experience with configuration management (Ansible, Salt, Puppet) and IaC tools like Terraform.
- Knowledge of containerization (Docker, Kubernetes, LXC) and their storage backends (CSI, RBD).
- Experience with monitoring and logging stacks (Prometheus, Grafana, ELK, OpenObserve).
- Familiarity with cloud platforms (Azure, Google Cloud Platform, OpenStack, AWS) and hybrid cloud storage.
Preferred Skills
- Contributions to the Ceph community or other distributed storage projects.
- Experience with large-scale data replication, backup, and disaster recovery strategies.
- Exposure to AI/ML workloads on Scale-Out storage and performance optimization for GPU clusters.
- Familiarity with hardware accelerators (NVMe-oF, SPDK, DPDK).
?
?
?
?
Minimum Qualifications...
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years' experience in software engineering or related area.
Option 2: 7 years' experience in software engineering or related area.
Preferred Qualifications...
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Master's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart's accessibility standards and guidelines for supporting an inclusive culture.
Primary Location...
1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America
Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.