Job Overview
We are seeking a highly skilled and proactive Lead Operations Engineer to join our team. This role requires deep technical expertise in AWS (VPC, S3, IAM, EC2, ECS, SageMaker), Databricks, and Terraform, along with strong communication skills to collaborate closely with business end users. The ideal candidate will ensure infrastructure is scalable, secure, and optimized for both technical and business needs.
Required Skills
5+ years of experience in DevOps, Cloud Operations, or Infrastructure Engineering roles.
Strong AWS expertise, including VPC, S3, IAM, EC2, ECS, and SageMaker.
Hands-on experience with Databricks and Apache Spark in production environments.
Proficient in writing Terraform modules and managing infrastructure via Infrastructure as Code (IaC).
Strong experience with CI/CD pipelines and tools such as GitHub Actions.
Proven ability to write, manage, and enforce security and IAM policies.
Excellent problem-solving skills and proactive approach to issue resolution.
Strong communication and stakeholder management skills; ability to work directly with non-technical users.
Good-to-Have Skills
AWS Certifications (e.g., DevOps Engineer, Solutions Architect).
Exposure to containerization and orchestration tools (Docker, Kubernetes).
Familiarity with data science or machine learning workflows.
Proficient in scripting languages (Unix, Python, Bash).
Key Responsibilities
Collaborate directly with business end users to understand operational issues, troubleshoot problems, and implement long-term solutions.
Design, implement, and maintain scalable cloud infrastructure using AWS services and Terraform.
Manage and optimize Databricks and SageMaker environments for performance, reliability, and cost efficiency.
Develop and enforce security policies (IAM, resource access policies) to ensure compliance and secure operations.
Implement and maintain Infrastructure as Code (IaC) using Terraform.
Build and maintain CI/CD pipelines using GitHub Actions for automated deployments.
Create automation scripts for deployment, monitoring, and reporting processes.
Monitor system performance and reliability; handle incident response and root cause analysis.
We are an equal opportunity employer. All aspects of employment including the decision to hire, promote, discipline, or discharge, will be based on merit, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, national origin, citizenship/ immigration status, veteran status, or any other status protected under federal, state, or local law.