ML/AI Operations Architect

  • Gaithersburg, MD
  • Posted 8 hours ago | Updated 8 hours ago

Overview

On Site
Depends on Experience
Full Time
Accepts corp to corp applications

Skills

Artificial Intelligence
Machine Learning

Job Details

The Machine Learning and Artificial Intelligence Operations team (ML/AI Ops) is a newly formed team will spearhead the design, creation, and operational excellence of our entire ML/AI data and computational AWS ecosystem to catalyze and accelerate science led innovations.

This team is responsible and accountable for the design, implementation, deployment, health and performance of all algorithms, models, ML/AI operations (MLOps, AIOps, and LLMOps) and Data Science Platform. We manage ML/AI and broader cloud resources, automating operations through infrastructure-as-code and CI/CD pipelines, and ensure best-in-class operations striving to push even beyond mere compliance with industry standards such as Good Clinical Practices (Google Cloud Platform) and Good Machine Learning Practice (GMLP).

As the ML/AI Platform Architect on our team, you will architect and oversee the global cloud ML/AI infrastructure that underpins our entire ML/AI value proposition. You will design, implement, and manage scalable cloud solutions using AWS services while establishing ML/AI governance frameworks, automating infrastructure with tools like AWS CDK and Projen, and conducting cost-benefit analyses of foundation models to drive strategic decisions across the organization.

This position requires a deep understanding of cloud-native ML/AI Ops methodologies and technologies, AWS infrastructure, State-of-the-art (SOTA) Foundation Models and AWS GenAI Services, and the unique demands of regulated industries, making it a cornerstone of our success in delivering impactful solutions to the pharmaceutical industry.

Accountabilities:

Operational Excellence

  • Lead by example in creating high-performance, mission-focused and interdisciplinary teams/culture founded on trust, mutual respect, growth mentalities, and an obsession for building extraordinary products with extraordinary people.
  • Drive the creation of proactive capability and process enhancements that ensures enduring value creation and analytic compounding interest.
  • Design and implement resilient cloud ML/AI operational capabilities to improve our system Abilities (Learnability, Flexibility, Extendibility, Interoperability, Scalability).
  • Drive precision and systemic cost efficiency, optimized system performance, and risk mitigation with a data-driven strategy, comprehensive analytics, and predictive capabilities at the tree-and-forest level of our ML/AI systems, workloads and processes.

ML/AI Cloud Operations and Engineering

  • Architect and implement scalable AWS ML/AI cloud infrastructure in a multi-tenant SaaS environment.
  • Establish governance frameworks for ML/AI infrastructure management and ensure compliance with industry standard processes.
  • Ensure principled and methodical validation pathways and a Well Architected Framework for Embryonic Research (WAFER) similar to and building on AWS s Well Architected Framework (WAF) for all early stage product and operational GenAI PoC s across the organization.
  • Oversee ML/AI related Kubernetes (k8s) cluster management and provide guidance on alternative ML/AI workflow orchestration options such as Argo vs Kubeflow, and ML/AI data pipeline creation, management and governance with tools like Airflow.
  • Employ AWS CDK (TypeScript), Projen, and Argo CD to automate infrastructure deployment and management.
  • Help set the strategy and manage the tactical balance between framework and platform experimentation and democratization with standardization and centralized management and governance
  • Conduct cost-benefit analyses and formal processes for selection and utilization of foundation models, evaluating their architectures, performance, and costs.
  • Work with multiple teams to ensure that the platform meets organizational needs and scales effectively.

Personal Attributes:

  • Customer-obsessed and passionate about building products that solve real-world problems.
  • Highly organized and diligent, with the ability to manage multiple initiatives and deadlines.
  • Collaborative and inclusive, fostering a positive team culture where creativity and innovation thrive.

Essential Skills/Experience:

  • HS Diploma and 5 years of experience in Engineering/IT solutions OR BA/BS
  • Minimum of 5 years in cloud infrastructure design and management roles.
  • Deep understanding of the Data Science Lifecycle (DSLC) and the ability to shepherd data science projects from inception to production within the platform architecture.
  • Expert in Typescript, AWS CDK, Projen, and Argo CD and other Cloud Infrastructure CI/CD Tools
  • Extensive experience in managing Kubernetes clusters for ML workflows.
  • Solid understanding of foundation models and their applications in ML/AI solutions.
  • Strong background in AWS DevOps practices and cloud architecture.
  • Deep knowledge of AWS services (Bedrock, Sagemaker, EC2, S3, RDS, Lambda, etc) and hands-on design and implementation cloud systems (microservices architecture, API design, and database management (SQL/NoSQL))
  • Experience with monitoring and optimizing cloud infrastructure for scalability and cost-efficiency.
  • Ability to collaborate effectively with engineering, design, product, science and security teams.
  • Strong written and verbal communication skills for reporting and documentation.
  • Demonstrated ability to manage large-scale, complex projects across an organization.
  • Proven experience in conducting performance and cost analyses of AWS infrastructure and ML/AI models.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.