Developing, customizing and deploying MLOps services like Vertex AI, SageMaker, Kubeflow
Prototyping and developing cloud-native architecture solutions for application needs, particularly with AWS
Providing infrastructure-as-code utilizing Terraform and AWS Cloud Formation
Provide on-call support for the platform
Perform automation, testing, performance tuning, and tools development.
Provisioning and maintaining cloud infrastructure that will support training machine learning model
Develop and deploy customized Kubernetes clusters for MLOps services like Kubeflow
Configure and integrate various MLOps application components such as model lifecycle management, model serving, hyperparameter tuning, object storage, load balancers, authentication, etc. (e.g. mlflow, knative, katib, minio, istio, dex, oidc authservice)
Understanding of the ML workflow, and how ML pipelines automate the workflow (data preprocessing, model training, model evaluation, hyperparameter tuning, model serving, model registries, etc.)
Build and test ML pipelines
Develop custom container images optimized for ML experimentation
Develop and deploy SageMaker domains with custom lifecycle configurations (e.g. idle kernel auto-shutdown) and custom images
Wide experience with Kubernetes and Docker is a must have
Industry experience with Amazon Web Services, IAM, VPC, API Gateway, NLB, ALB, EC2, ECS, EKS, Lambda, S3, RDS, DynamoDB, SQS, etc.
Candidate must have demonstrated a strong knowledge of Linux systems
Proficiency in Python and Bash scripting is a must
Experience in CI/CD/CT pipelines implementation. Deployment automation with CICD tools and Infrastructure-as-Code (IaC)
Good understanding of networking and related protocols. (HTTP, DNS, TLS, TCP)
Candidates must have demonstrated experience in troubleshooting problems and working with a team to resolve production issues.
Understanding of cloud provisioning tools, e.g. CloudFormation and Terraform.
Good understanding of database technologies
Intimate familiarity with the DevOps toolkit (Terraform, Ansible, Chef, and other tools).
Exposure to messaging pub/sub systems (e.g. AWS SNS, SQS, RedisQ etc.)
Exposure to data science IDEs like Rstudio or Jupyter notebook is a huge plus