Job Role: Senior Linux System Administrator
Location: Irving, TX
Job Description:
Role Summary
Seeking a hands-on L3 Linux Administrator to own stability, availability, and performance across large-scale Linux environments. The role demands deep troubleshooting skills, strong exposure to Veritas Clustering (VCS), SAN/NAS storage, and close coordination with data center teams for hardware incidents. The ideal candidate will work independently, lead incident resolution, and improve BAU operations through automation and best practices.
________________________________________
Key Responsibilities
Linux Administration (L3)
- Administer and troubleshoot RHEL, Oracle Linux, CentOS, SUSE in production.
- Diagnose complex OS issues: kernel panics, boot/GRUB failures, filesystem corruption, resource contention (CPU/RAM/I/O/Network), SELinux/AppArmor denials.
- Patch and upgrade OS at scale; manage package repositories and kernel updates with rollback strategies.
- Implement and audit security hardening (firewalld/iptables, CIS benchmarks, PAM, sudo, SSH, audited).
- Manage system services (systemd), cron/timers, users/groups, sudoers, and system-wide configuration.
Veritas Cluster Server (VCS/InfoScale)
- Install, configure, and administer VCS for HA/DR across multi-node clusters.
- Create/maintain service groups, resources, dependency trees; configure LLT/GAB, I/O fencing, and quorum.
- Integrate VxVM/VxFS (disk groups, volumes, file systems) with application failover.
- Conduct DR drills, failover testing, and root cause analysis for cluster events.
Storage: SAN & NAS
- Liaise with storage teams for LUN provisioning, zoning, masking; validate multipathing (DM Multipath/Power Path).
- Build and maintain filesystems (ext4/xfs/VxFS), mount policies, fstab and autofs.
- Manage NFS/CIFS/SMB exports/mounts, permissions, quotas, and locking issues.
- Troubleshoot pathing, latency, and I/O bottlenecks using OS, HBA, and array-side telemetry.
Data Center & Hardware Coordination
- Coordinate with DC teams for racking/stacking, cabling, console access, and physical triage.
- Diagnose hardware faults (CPU, memory, NIC/HBA, disks/RAID/SSD, backplane, PSU, fans) and firmware/BIOS alignment.
- Raise and track OEM tickets (Dell/HP/IBM/Cisco), manage RMA, and oversee replacements and post-fix validation.
BAU Operations & Incident Management
- Act as L3 escalation for P1/P2 incidents; drive bridge calls and lead technical recovery.
- Perform deep-dive log analysis (journald, syslog, dmesg, audit logs, application logs).
- Create/run SOPs/runbooks, maintain KB articles, and implement problem management (RCA, corrective actions).
- Support on-call rotation and scheduled maintenance windows (change management, CAB, MOPs).
Networking (Host-Level)
- Troubleshoot TCP/IP, routing, VLANs/bonding/teaming, MTU, host firewalls, DNS/DHCP, NTP/Chrony.
- Collaborate with network teams on L2/L3 connectivity, load balancers, and firewall rules.
________________________________________
Required Experience & Skills
- 8 12+ years in enterprise Linux system administration with proven L3 ownership.
- Strong hands-on with VCS (Veritas Cluster Server), VxVM, VxFS, and HA/DR patterns.
- Solid SAN/NAS experience: LUNs, zoning, multipath, NFS/SMB.
- Demonstrated success working independently and leading during critical incidents.
- Advanced troubleshooting: kernel, performance, storage, and cluster-level failures.
- Scripting proficiency (Bash; Python preferred). Familiar with Ansible.
- Familiarity with VMware/KVM and basic cloud (AWS/Azure/Linux in cloud) concepts.
- Strong documentation discipline (SOPs, MOPs, RCAs) and ITIL-aligned processes.