Key Responsibilities
• Multi-Cloud Pipeline Execution: Build and maintain automated CI/CD and CT (Continuous Training) pipelines across AWS (SageMaker/Bedrock) and Azure (AI Studio).
• LLMOps Framework Implementation: Design and execute the infrastructure for Retrieval-Augmented Generation (RAG), including vector database management (OpenSearch, Pinecone, or Azure AI Search) and semantic index optimization.
• Legacy Data Connectivity: Build the engineering "pipes" to securely ingest and move data from legacy systems (Mainframes, SQL Server, on-prem DBs) into cloud-native MLOps workflows.
• Automated Model Evaluation: Implement systemized frameworks for LLM evaluation (LLM-as-a-judge, ROUGE, METEOR) and traditional ML validation to ensure performance before deployment.
• Observability & Monitoring: Deploy real-time monitoring for model drift, hallucination detection, latency, and token consumption to manage both quality and cost.
• Infrastructure as Code (IaC): Manage all AI resources using Terraform or CloudFormation, ensuring the cloud posture is reproducible, secure, and follows a "Privacy by Design" mandate.
• Advanced Analytics Integration: Partner with teams using platforms like Palantir, Databricks, or Snowflake to ensure a high-fidelity data flow between analytical ontologies and production models.
• IT & Security Diplomacy: Work directly with central IT and Security to navigate IAM roles, VPC peering, and firewall configurations, clearing the path for rapid transformation.
• Scalable Inference Engineering: Optimize model serving endpoints for high-throughput and low-latency, utilizing containerization (Docker/Kubernetes) and serverless architectures where appropriate.
• Prompt & Model Versioning: Establish rigorous version control for prompts (PromptOps), model weights, and data snapshots to ensure 100% auditability and rollback capability.
• Data Science Engineering: Support the data science lifecycle by automating feature stores, feature engineering pipelines, and the transition of experimental notebooks into hardened production microservices.
• Security & Compliance Hardening: Implement automated scanning and guardrails (e.g., Bedrock Guardrails or Azure Content Safety) to prevent prompt injection and data leakage.
Qualifications
• Education: Bachelor’s degree in Computer Science or a related field required; Master’s degree in a quantitative discipline highly desirable.
• Proven Execution: 6+ years of engineering experience, with a minimum of 3 years strictly focused on MLOps or LLMOps in a production environment.
• AWS & Azure Mastery: Deep, hands-on proficiency in both ecosystems. You must be able to configure Bedrock and Azure OpenAI services, including private networking and endpoint security, on day one.
• Technical Stack: Expert Python, SQL, and PySpark. Extensive experience with containerization (Docker, Kubernetes) and orchestration tools (Airflow, Kubeflow, or Step Functions).
• LLM Tooling: Professional experience with evaluation and observability frameworks like LangSmith, Arize Phoenix, or WhyLabs.
• Data Science Flavor: A strong understanding of statistical validation, model evaluation metrics, and the ability to partner with Data Scientists to optimize model performance.
• Transformation Mindset: The ability to move at the speed of a startup while maintaining the collaborative relationships required to function within a large-scale enterprise IT landscape.