Job Title: Sr Cloud Platform Engineer
Duration: 12+ Months
Location: Irvine, CA (Hybrid)
Key Responsibilities
· Design and operate scalable, highly available, multi-account cloud infrastructure (AWS)
· Build and maintain Infrastructure-as-Code modules and standards using Terraform
· Develop reusable platform patterns, landing zones, and golden paths for engineering teams
· Optimize and operate CI/CD pipelines (Jenkins, GitHub Actions, Harness)
· Enable developer self-service and reduce manual intervention through automation
· Manage Kubernetes platforms (EKS) — networking, scaling, upgrades, and workload onboarding
· Operate and support Kafka-based event streaming platforms — topics, schemas, connectors, and cluster reliability
· Build and integrate REST APIs and self-service tooling to streamline platform workflows
· Implement cloud security and governance (IAM, OAuth/OIDC, OKTA, SSL/TLS, secrets management)
· Drive cloud cost optimization, capacity planning, and FinOps practices
· Implement observability — metrics, logging, tracing, alerting, and SLOs
· Lead incident response, troubleshooting, and root cause analysis across platform and runtime systems
· Partner with application teams to troubleshoot infrastructure, deployment, and runtime issues
· Drive continuous improvement using operational insights and user feedback
· Enhance documentation, runbooks, and platform usability
Technical Skills
· Cloud: AWS (EKS, EC2, VPC/Networking, IAM, S3, RDS, Lambda)
· IaC: Terraform (modules, state management, policy-as-code)
· CI/CD: GitHub Actions, Harness
· APIs & Integration: REST APIs (design, development, integration), Async APIs
· Containers & Orchestration: Docker, Kubernetes (EKS)
· Event Streaming: Kafka, Confluent (topics, schemas, Kafka Connect, cluster linking)
· Monitoring/Observability: Datadog, CloudWatch
· Security: OAuth, OIDC, OKTA, SSL/TLS, IAM, secrets management
· Programming/Scripting: Java/Python
Key Competencies
· Strong troubleshooting and problem-solving across distributed systems
· Ability to translate operational issues into durable platform improvements
· Systems-thinking approach to reliability, security, and cost
· Effective collaboration and technical mentorship across engineering teams