Job Title: Senior Cloud DevOps & Infrastructure Engineer (Google Cloud Platform/AI Focus)
Location: Dallas, TX (Onsite)
Job Details: FTE
Job Description
We are looking for a hands-on Cloud Infrastructure & DevOps Engineer to design, deploy, and maintain secure, scalable infrastructure primarily on Multi Cloud Platform. The ideal candidate should be an expert in Kubernetes, Terraform, and GitLab CI/CD, with specific experience supporting AI/ML workloads. You will bridge the gap between development and operations, implementing GitOps best practices and troubleshooting complex production deployments in a hardened security environment.
Key Responsibilities
- Infrastructure as Code (IaC): Architect and provision production-grade infrastructure using Terraform. Manage state files, modules, and ensure infrastructure immutability.
- AIML: Experience with LLM Models - in multi cloud environment
- Kubernetes & Containerization: Design and manage clusters. Create and optimize Docker files (multi-stage builds, distroless/hardened images). Manage complex deployments using Helm Charts.
- CI/CD & GitOps: Build end-to-end CI/CD pipelines using GitLab CI. Implement GitOps workflows to synchronize infrastructure and application state.
- Design, configure, and manage scalable and secure cloud infrastructure for MLOps.
- AI Infrastructure Support: Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage).
- Production Support & Troubleshooting: Act as the primary escalation point for deployment failures, network and Infra issues. Perform Root Cause Analysis (RCA).
- Security & Compliance: Implement 'Secure by Design' principles.
- Having good knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS).
- Multi-Cloud Strategy: While Google Cloud Platform is primary, maintain and support secondary environments in AWS (and potentially Azure) to ensure business continuity.
Technical Skills (Must-Have)
- Cloud infrastructure design and implementation is the primary skill with experience in Azure and AWS.
- Cloud Platforms: Deep expertise in Google Cloud Platform (Compute Engine, GKE, Cloud Storage, IAM). Strong working knowledge of AWS (EC2, EKS, S3, IAM).
- Knowledge of using various programming languages. (Python (Required), Knowledge of Java, C#, JavaScript is a plus).
- Container Orchestration: Advanced proficiency in Kubernetes. Ability to write and manage custom Helm charts. Experience with Ingress Controllers (Nginx), Service Mesh, and Autoscaling (HPA/VPA/Cluster Autoscaler).
- DevOps & CI/CD: Expert-level knowledge of GitLab CI/CD (writing .gitlab-ci.yml, runners, artifacts, caching). Understanding GitOps principles.
- Infrastructure Provisioning: Strong hands-on experience with Terraform for provisioning cloud resources across multiple environments (Dev/Stage/Prod).
- Programming Skills: Proficiency in Bash/Shell scripting and Python. Strong Linux administration skills.
- Observability: Experience setting up monitoring and ing using Cloud Native tools, Prometheus, and Grafana.
Good-to-Have Skills (Preferred)
- Experience with Azure Cloud infrastructure.
- Knowledge of Identity Providers (Keycloak, Azure AD/Entra ID) and OIDC integration.
- Experience with Service Mesh
- Understanding of ITIL processes (Incident/Change Management) and tools like ServiceNow, JIRA.
- Basic understanding of Python/Flask/Fast API applications to assist developers in troubleshooting.
Behavioral & Soft Skills
- Problem Solver: Ability to debug complex networking (Proxy/DNS) and application issues in a distributed environment.
- Collaboration: Ability to work closely with Data Scientists and Backend Developers to understand AI workload requirements.
- Agility: Highly agile with the ability to learn and adapt to new technologies quickly.
- Communication: Strong written and verbal communication skills for documentation and cross-team coordination.
Certifications (Highly Preferred):
o Google Professional Cloud Architect or Cloud DevOps Engineer.
o Certified Kubernetes Administrator (CKA).
o HashiCorp Certified: Terraform Associate.
o AWS Certified Solutions Architect (Associate/Professional).
We are an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, citizenship status, age, disability, genetic information, protected veteran status, or any other characteristic protected by applicable law.