Title: Linux Admin Location: Remote. Key Responsibilities ● Infrastructure Management: Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Cloud to support AI workloads. ● System Management: Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance. ● GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI/ML and HPC applications. ● Troubleshooting: Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters. ● High-Speed Interconnects: Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads. ● CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions. ● Monitoring & Performance: Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources. ● Security and Compliance: Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001. Required Qualifications ● Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud. ● Linux Administration: Strong knowledge of Linux system administration for managing network services and tools in a GPU compute environment. ● High-Speed Interconnects: Experience with high-performance networking technologies like RoCE, or 100GbE Ethernet in compute-intensive environments. ● GPU-Specific Networking: Proficiency with NVIDIA GPU networking technologies, such as Mellanox ConnectX adapters, and configuring Netplan to support their drivers and firmware. ● Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, Google Cloud Platform). ● Networking & Security: Knowledge of networking concepts (VPC, subnets) and security best practices (IAM, encryption, firewall configurations). |