Cloud /ML Infrastructure Engineer

Overview

On Site

Compensation information provided in the description

Contract - Independent

Skills

FOCUS

Workflow

Product Management

IaaS

ProVision

Terraform

Continuous Integration

Continuous Delivery

Testing

GitHub

Computer Networking

Virtual Private Cloud

Amazon Web Services

Amazon SageMaker

Streaming

LangChain

Time Series

Step-Functions

PostgreSQL

Amazon DynamoDB

Evaluation

Regression Analysis

Dashboard

Budget

Routing

Regulatory Compliance

Access Control

Auditing

Extract

Transform

Load

Multimedia

Amazon S3

Storage

IoT

Release Management

Management

Messaging

Adobe AIR

Analytics

Application Support

Cloud Computing

Machine Learning (ML)

Performance Tuning

Caching

Redis

GPU

CPU

Privacy

Marketing

Job Details

Location: San Carlos, CA
Salary: $100.00 USD Hourly - $120.00 USD Hourly
Description:
Title: Cloud /Ml Infrastructure Engineer

Location: Hybrid- San Carlos, CA

This is a hybrid role and requires working from our San Carlos, CA office at least three days a week, with the option to work remotely the remaining days

Duration: 1 year contract

Job Description:

We are seeking skilled Cloud and ML Infrastructure Engineers to lead the buildout of our AWS foundation and our LLM platform. You will design, implement, and operate services that are scalable, reliable, and secure.

The broad scope means focus areas in LLM/ML Infra and IoT infra are strong bonus points. For ML Infra, build the stack that powers retrieval-augmented generation and application workflows built with frameworks like LangChain. Experience with IoT AWS services is a plus.

You will work closely with other engineers and product management. The ideal candidate is hands-on, comfortable with ambiguity, and excited to build from first principles.

Key Responsibilities

Cloud Infrastructure Setup and Maintenance

Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform.

Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline.

Operate secure networking with VPCs, PrivateLink, and VPC endpoints. Manage IAM, KMS, Secrets Manager, and audit logging.

LLM Platform and Runtime

Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs.

Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies.

Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling.

RAG Data Systems and Vector Search

Design chunking and embedding pipelines for documents, time series, and multimedia. Orchestrate with Step Functions or Airflow.

Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone. Tune recall, latency, and cost.

Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources.

Evaluation, Observability, and Cost Governance

Create offline and online eval harnesses for prompts, retrievers, and chains. Track quality, latency, and regression risk.

Instrument model and app telemetry with CloudWatch and OpenTelemetry. Build token usage and cost dashboards with budgets and alerts.

Add guardrails, rate limits, fallbacks, and provider routing for resilience.

Safety, Privacy, and Compliance

Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review where needed.

Use Bedrock Guardrails or policy services to enforce safety standards. Maintain audit trails for regulated environments.

Data Pipeline Construction

Build ingestion and processing pipelines for structured, unstructured, and multimedia data. Ensure integrity, lineage, and cataloging with Glue and Lake Formation.

Optimize bulk data movement and storage in S3, Glacier, and tiered storage. Use Athena for ad-hoc analysis.

IoT Deployment Management

Manage infrastructure that deploys to and communicates with edge devices. Support secure messaging, identity, and over-the-air updates.

Analytics and Application Support

Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features.

Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance.

Performance Optimization

Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases.

Optimize inference with model selection, quantization where applicable, GPU/CPU instance choices, and autoscaling strategies.

By providing your phone number, you consent to: (1) receive automated text messages and calls from the Judge Group, Inc. and its affiliates (collectively "Judge") to such phone number regarding job opportunities, your job application, and for other related purposes. Message & data rates apply and message frequency may vary. Consistent with Judge's Privacy Policy, information obtained from your consent will not be shared with third parties for marketing/promotional purposes. Reply STOP to opt out of receiving telephone calls and text messages from Judge and HELP for help.

Contact:

This job and many more are available through The Judge Group. Please apply with us today!

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Judge Group, Inc.

Share