Senior Cloud DevOps & Infrastructure Engineer (Google Cloud Platform/AI Focus)

Dallas, TX, US • Posted 3 hours ago • Updated 3 hours ago
Full Time
On-site
Depends on Experience
Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

  • API
  • Amazon S3
  • Amazon Web Services
  • Google Cloud Platform
  • Machine Learning (ML)
  • Machine Learning Operations (ML Ops)

Summary

Job Title: Senior Cloud DevOps & Infrastructure Engineer (Google Cloud Platform/AI Focus)

Location: Dallas, TX (Onsite)

Job Details: FTE

 

Job Description

 

We are looking for a hands-on Cloud Infrastructure & DevOps Engineer to design, deploy, and maintain secure, scalable infrastructure primarily on Multi Cloud Platform. The ideal candidate should be an expert in Kubernetes, Terraform, and GitLab CI/CD, with specific experience supporting AI/ML workloads. You will bridge the gap between development and operations, implementing GitOps best practices and troubleshooting complex production deployments in a hardened security environment.

 

Key Responsibilities

  • Infrastructure as Code (IaC): Architect and provision production-grade infrastructure using Terraform. Manage state files, modules, and ensure infrastructure immutability.
  • AIML: Experience with LLM Models - in multi cloud environment
  • Kubernetes & Containerization: Design and manage clusters. Create and optimize Docker files (multi-stage builds, distroless/hardened images). Manage complex deployments using Helm Charts.
  • CI/CD & GitOps: Build end-to-end CI/CD pipelines using GitLab CI. Implement GitOps workflows to synchronize infrastructure and application state.
  • Design, configure, and manage scalable and secure cloud infrastructure for MLOps.
  • AI Infrastructure Support: Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage).
  • Production Support & Troubleshooting: Act as the primary escalation point for deployment failures, network and Infra issues. Perform Root Cause Analysis (RCA).
  • Security & Compliance: Implement 'Secure by Design' principles.
  • Having good knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS).
  • Multi-Cloud Strategy: While Google Cloud Platform is primary, maintain and support secondary environments in AWS (and potentially Azure) to ensure business continuity.

 

Technical Skills (Must-Have)

  • Cloud infrastructure design and implementation is the primary skill with experience in Azure and AWS.
  • Cloud Platforms: Deep expertise in Google Cloud Platform (Compute Engine, GKE, Cloud Storage, IAM). Strong working knowledge of AWS (EC2, EKS, S3, IAM).
  • Knowledge of using various programming languages. (Python (Required), Knowledge of Java, C#, JavaScript is a plus).
  • Container Orchestration: Advanced proficiency in Kubernetes. Ability to write and manage custom Helm charts. Experience with Ingress Controllers (Nginx), Service Mesh, and Autoscaling (HPA/VPA/Cluster Autoscaler).
  • DevOps & CI/CD: Expert-level knowledge of GitLab CI/CD (writing .gitlab-ci.yml, runners, artifacts, caching). Understanding GitOps principles.
  • Infrastructure Provisioning: Strong hands-on experience with Terraform for provisioning cloud resources across multiple environments (Dev/Stage/Prod).
  • Programming Skills: Proficiency in Bash/Shell scripting and Python. Strong Linux administration skills.
  • Observability: Experience setting up monitoring and ing using Cloud Native tools, Prometheus, and Grafana.

 

Good-to-Have Skills (Preferred)

  • Experience with Azure Cloud infrastructure.
  • Knowledge of Identity Providers (Keycloak, Azure AD/Entra ID) and OIDC integration.
  • Experience with Service Mesh
  • Understanding of ITIL processes (Incident/Change Management) and tools like ServiceNow, JIRA.
  • Basic understanding of Python/Flask/Fast API applications to assist developers in troubleshooting.

 

Behavioral & Soft Skills

  • Problem Solver: Ability to debug complex networking (Proxy/DNS) and application issues in a distributed environment.
  • Collaboration: Ability to work closely with Data Scientists and Backend Developers to understand AI workload requirements.
  • Agility: Highly agile with the ability to learn and adapt to new technologies quickly.
  • Communication: Strong written and verbal communication skills for documentation and cross-team coordination.

 

Certifications (Highly Preferred):

o Google Professional Cloud Architect or Cloud DevOps Engineer.

o Certified Kubernetes Administrator (CKA).

o HashiCorp Certified: Terraform Associate.

o AWS Certified Solutions Architect (Associate/Professional).

 

We are an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, citizenship status, age, disability, genetic information, protected veteran status, or any other characteristic protected by applicable law. 

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91081485
  • Position Id: 9002083
  • Posted 3 hours ago
Contact the job poster
SS

Sugan Selvaraj

Recruiter @ Galent
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Dallas, Texas

17d ago

Full-time

Dallas, Texas

6d ago

Easy Apply

Third Party, Contract

Depends on Experience

Dallas, Texas

Today

Easy Apply

Third Party, Contract

Remote or Plano, Texas

Yesterday

Full-time

USD 53,000.00 - 79,000.00 per year

Search all similar jobs