Linux Admin

Overview

Remote

50 - 60

Contract - W2

Contract - 6 Month(s)

No Travel Required

Unable to Provide Sponsorship

Skills

Linux

Linux Administration

ProVision

Network Security

Microsoft Azure

Computer Hardware

Computer Networking

Firmware

Firewall

Continuous Integration and Development

Continuous Integration

DevOps

Encryption

Continuous Delivery

Amazon Web Services

Cloud Computing

Google Cloud Platform

Good Clinical Practice

Servers

Remote Direct Memory Access

Reliability Engineering

Regulatory Compliance

Grafana

Virtual Private Cloud

VLAN

Virtual Private Network

Software Troubleshooting

Communication

Ethernet

GPU

CUDA

HPC

Adapter

High Availability

ISO/IEC 27001:2005

Job Details

Title: Linux Admin
Location: Remote.
Key Responsibilities
● Infrastructure Management: Provision, deploy, and maintain scalable, secure, and
high-availability cloud infrastructure on platforms such as Cloud to support
AI workloads.
● System Management: Administer and maintain Linux-based servers and clusters
optimized for GPU compute workloads, ensuring high availability and performance.
● GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA
GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in
AI/ML and HPC applications.
● Troubleshooting: Diagnose and resolve hardware and software issues related to GPU
compute nodes and performance issues in GPU clusters.
● High-Speed Interconnects: Implement and manage high-speed networking
technologies like RDMA over Converged Ethernet (RoCE) to support low-latency,
high-bandwidth communication for GPU workloads.
● CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD)
pipelines for testing GPU-based servers and managing deployments using tools like
GitHub Actions.
● Monitoring & Performance: Set up and maintain monitoring, logging, and alerting
systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance,
GPU utilization, resource bottlenecks, and uptime of GPU resources.
● Security and Compliance: Implement network security measures, including firewalls,
VLANs, VPNs, and intrusion detection systems, to protect the GPU compute
environment and comply with standards like SOC 2 or ISO 27001.
Required Qualifications
● Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or
cloud infrastructure management, with at least 1 year working on GPU-based compute
environments in the cloud.
● Linux Administration: Strong knowledge of Linux system administration for managing
network services and tools in a GPU compute environment.
● High-Speed Interconnects: Experience with high-performance networking technologies
like RoCE, or 100GbE Ethernet in compute-intensive environments.
● GPU-Specific Networking: Proficiency with NVIDIA GPU networking technologies,
such as Mellanox ConnectX adapters, and configuring Netplan to support their drivers
and firmware.
● Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS,
Azure, Google Cloud Platform).
● Networking & Security: Knowledge of networking concepts (VPC, subnets) and
security best practices (IAM, encryption, firewall configurations).

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share