AI/ML Cloud Engineer
Location: Bloomfield, CT
Type: Hybrid - 3 days onsite
Duration: Long term Contract - W2
Key Responsibilities :
Cloud Infrastructure Management
Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, EKS, VPC, ECS, S3, Lambda, Route 53 and Kubernetes clusters.
Provision and configure storage, networking, and security services for AI platforms.
Ensure high availability, scalability, and reliability of AI environments.
AI Platform Support
* Deploy and maintain AI/ML services such as:
* Amazon SageMaker and Azure Microsoft Foundry
* Azure Machine Learning
* AI model training and inference environments
* Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment.
Automation & Infrastructure as Code
* Implement Infrastructure as Code (IaC) using tools such as:
* Terraform
* CloudFormation
* ARM templates/Bicep
* Docker Files
* Automate and set up environment provisioning, patching, and scaling.
Containerization & Orchestration
* Deploy and manage containerized AI workloads using:
* Docker
* Kubernetes
* Amazon EKS
* Azure Kubernetes Service (AKS)
* ECS
Monitoring & Performance Optimization
* Monitor system health, performance, and resource utilization using tools like:
* CloudWatch
* Azure Monitor
* Datadog / Prometheus
* Optimize infrastructure for cost, performance, and GPU utilization.
Security & Compliance
* Implement cloud security best practices including:
* IAM / RBAC management
* Network security groups
* Encryption and secrets management
* Ensure compliance with organizational and regulatory standards.
* CI/CD & DevOps Integration
* Integrate AI infrastructure with CI/CD pipelines.
* Support automated deployment of models and AI services.
Required Qualifications
* Bachelor''''s degree in Computer Science, Information Systems, or related field.
* 5+ years experience in infrastructure administration or cloud engineering.
* Strong hands-on experience with:
* Experience supporting AI/ML infrastructure or data platforms.
* Proficiency with Linux administration and scripting (Python, Bash, PowerShell, Terraform, terra grunt, ).
* Experience with Docker and Kubernetes.
* Experience with GitHub Actions
* Experience with LLM infrastructure set up
* Experience with working in centralized team with triaging capabilities.
* AWS cloud services
* Microsoft Azure cloud services