Job Title: Lead Infrastructure Admin – AI Cloud Services (Azure & AWS)
Location: California, USA (Remote – Candidates must be in CA Area)
Duration: 6+ Months Contract
Start Date: 05/15/2026
Job Summary
We are seeking an experienced Infrastructure Administrator / Cloud Engineer with strong expertise in supporting AI/ML cloud environments across both Azure and AWS platforms.
The ideal candidate should have hands-on experience managing scalable cloud infrastructure, AI model hosting environments, Kubernetes clusters, DevOps automation, and Infrastructure as Code for enterprise AI services.
This role will focus heavily on LLM infrastructure setup, GPU workloads, cloud security, CI/CD enablement, and AI platform administration.
Healthcare industry exposure is highly preferred.
Mandatory Experience
- Minimum 5+ years of experience in Infrastructure Administration / Cloud Infrastructure Engineering
- Strong recent experience supporting AI services infrastructure on Azure & AWS
- Experience working in centralized support / triaging teams
- Experience supporting production-grade AI/ML platforms, model training, and inference workloads
Must Have Technical Skills
- Microsoft Azure
- AWS Cloud Computing
- DevOps / CI-CD
- Artificial Intelligence Infrastructure Support
- Azure Platform Services
- Cloud Services Administration
- Kubernetes Clusters
- Infrastructure as Code
- LLM Infrastructure Setup
Key Responsibilities
1. Cloud Infrastructure Management
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
- Manage services such as:
- EC2
- Azure Virtual Machines
- GPU Instances
- EKS / AKS
- ECS
- VPC
- S3
- Lambda
- Route 53
- Kubernetes Clusters
- Configure networking, storage, compute, and security services for AI environments.
- Ensure high availability, reliability, scalability, and fault tolerance.
2. AI / ML Platform Support
- Deploy and maintain enterprise AI/ML services including:
- Amazon SageMaker
- Azure Machine Learning
- Azure AI Foundry
- Build and maintain AI model training and inference environments.
- Support Data Scientists, ML Engineers, and AI teams with optimized GPU/cloud infrastructure.
- Assist with LLM deployment environments and GenAI service administration.
3. Automation / Infrastructure as Code
- Implement Infrastructure as Code using:
- Terraform
- Terragrunt
- CloudFormation
- ARM Templates / Bicep
- Dockerfiles
- Automate provisioning, patching, configuration management, and environment scaling.
4. Containerization & Orchestration
- Deploy and manage containerized AI workloads using:
- Docker
- Kubernetes
- Amazon EKS
- Azure Kubernetes Service (AKS)
- Amazon ECS
- Manage cluster administration, scaling, pod monitoring, and deployment troubleshooting.
5. Monitoring & Performance Optimization
- Monitor AI infrastructure using:
- CloudWatch
- Azure Monitor
- Datadog
- Prometheus
- Optimize:
- GPU utilization
- Cloud cost
- AI workload performance
- Resource consumption
6. Security & Compliance
- Implement:
- IAM / RBAC
- Network Security Groups
- Encryption
- Secrets Management
- Ensure enterprise compliance and secure cloud governance.
7. DevOps & CI/CD Integration
- Integrate AI workloads with CI/CD pipelines.
- Support automated deployment of ML models and AI services.
- Work with:
- GitHub Actions
- Terraform pipelines
- Container deployment automation
Required Qualifications
- Bachelor’s Degree in Computer Science / Information Systems / Related Field
- 5+ years in Cloud Infrastructure / Infrastructure Administration
- Strong Linux Administration
- Scripting experience in:
- Strong experience with Terraform / Terragrunt
- Experience with Docker & Kubernetes
- Experience with GitHub Actions
- Hands-on experience with LLM/GenAI Infrastructure setup
- Experience in incident triaging / centralized cloud operations