Staff Machine Learning Engineer - SMLENGG 25-33679

Overview

On Site
Depends on Experience
Accepts corp to corp applications
Contract - Independent
Contract - W2
No Travel Required

Skills

Virtual Private Cloud
Workflow
Technical Communication
Testing
Timing Closure
Training
Shipping
Specification Gathering
Regression Analysis
Regulatory Compliance
Research
Optimization
Legal
Step-Functions
System On A Chip
SystemVerilog
Translation
Performance Tuning
Licensing
Machine Learning (ML)
Machine Learning Operations (ML Ops)
Customization
DLP
Data Analysis
EDA
Encryption
Code Review
Continuous Delivery
Apache Velocity
RTL
Roadmaps
SAFE
IDE
Continuous Integration
Evaluation
GitHub
Privacy
PyTorch
Python
GitLab
HDL
Hardware Development
ISO/IEC 27001:2005
Leadership
PASS
Artificial Intelligence
Caching
Code Refactoring
Continuous Improvement
Verilog
Amazon EC2
Amazon S3
Amazon SageMaker
Amazon Web Services
Auditing

Job Details

Staff Machine Learning Engineer, LLM Fine-Tuning (Verilog/RTL Applications)

Level: Staff
Location: San Jose, CA
Cloud: AWS (primary: Bedrock + SageMaker)

Why this role exists

You will architect and lead privacy-preserving LLM capabilities that support hardware design teams working with Verilog/SystemVerilog and RTL artifacts. This includes code generation, refactoring, lint explanation, constraint translation, and spec-to-RTL assistance. You ll lead a small, high-leverage team focused on fine-tuning and productizing LLMs in a strict enterprise data-privacy environment.

You do not need deep RTL expertise to start curiosity, LLM craftsmanship, and strong engineering rigor matter most. Exposure to HDL/EDA tooling is a plus.

Responsibilities

Technical Leadership & Roadmap

  • Own the end-to-end roadmap for Verilog/RTL-focused LLM capabilities, covering model selection, fine-tuning, evals, deployment, and continuous improvement.

  • Lead a hands-on team of applied ML engineers/scientists, unblock technically, review designs and code, and drive experimentation velocity and reliability.

Model Training & Customization

  • Fine-tune and customize models using modern techniques (LoRA/QLoRA, PEFT, instruction tuning, RLAIF/preference optimization).

  • Build HDL-aware evaluation workflows:

    • Compile/lint/simulate-based pass rates

    • Pass@k for code generation

    • Constrained decoding enforcing HDL syntax

    • Does-it-synthesize? checks

Privacy-First AWS ML Pipelines

  • Design secure training & inference environments using AWS services such as:

    • Amazon Bedrock (incl. Anthropic models)

    • SageMaker or EKS + KServe/Triton/DJL for bespoke training

  • Implement strict privacy controls:

    • Artifacts in S3 with KMS CMKs

    • VPC-only infrastructure with PrivateLink (incl. Bedrock endpoints)

    • IAM least-privilege, CloudTrail auditing

    • Secrets Manager for credential handling

    • Full encryption in transit/at rest

    • No public egress for customer/RTL corpora

Inference & Deployment

  • Stand up scalable, reliable LLM serving:

    • Bedrock model invocation where applicable

    • Low-latency self-hosted inference (vLLM/TensorRT-LLM)

    • Autoscaling and canary/blue-green rollouts

Evaluation Culture & Tooling

  • Build automated regression suites running HDL compilers/simulators to measure correctness and detect hallucinations.

  • Track experiments and produce model cards using MLflow/W&B.

Cross-Functional Collaboration

  • Work with hardware design teams, CAD/EDA, Security, and Legal to:

    • Prepare/anonymize datasets

    • Define acceptance gates

    • Meet licensing, compliance, and security requirements

Productization

  • Integrate models into engineering workflows: IDE plugins, CI bots, code review assistants, retrieval over internal HDL repos/specs, and safe function-calling.

Mentorship

  • Develop team capabilities in LLM training, reproducibility, secure pipelines, and research literacy.

Minimum Qualifications

  • 10+ years total engineering experience; 5+ years in ML/AI or large-scale distributed systems; 3+ years with transformers/LLMs.

  • Proven record shipping LLM-powered features and leading cross-functional technical initiatives at Staff level.

  • Deep, hands-on experience with:

    • PyTorch, Hugging Face Transformers/PEFT/TRL

    • Distributed training (DeepSpeed/FSDP)

    • LoRA/QLoRA, grammar-guided decoding

  • Strong AWS expertise:

    • Bedrock (model customization, Guardrails, Knowledge Bases, VPC endpoints)

    • SageMaker (Training/Inference/Pipelines)

    • S3, EC2/EKS/ECR, IAM, VPC, KMS, CloudWatch/CloudTrail, Step Functions, Secrets Manager

  • Strong Python engineering fundamentals (testing, CI/CD, observability, performance tuning).

  • Excellent technical communication and ability to set vision across teams.

Preferred Qualifications

  • Familiarity with Verilog/SystemVerilog/RTL workflows (lint, simulation, synthesis, timing closure, test benches).

  • Experience with static-analysis/AST-aware tokenization and grammar-constrained decoding.

  • RAG over code/spec repos; tool-use/function-calling for code transformation.

  • Inference optimization (TensorRT-LLM, KV-cache tuning, speculative decoding).

  • Experience with enterprise model governance and security frameworks (SOC2/ISO 27001/NIST).

  • Background in data anonymization, DLP scanning, and code de-identification.

What success looks like

90 Days

  • Stand up HDL-aware eval harness with compile/simulate checks.

  • Establish secure AWS training & inference environments (VPC-only, KMS encryption, no public egress).

  • Deliver initial fine-tuned model with measurable performance gains.

180 Days

  • Expand training coverage using Bedrock + SageMaker/EKS.

  • Add constrained decoding and retrieval over design specs.

  • Productionize inference with SLOs and rollout to pilot teams.

12 Months

  • Reduce RTL review/iteration cycles using measurable metrics: lint-clean time, defect reductions, suggestion acceptance rates.

  • Establish a stable MLOps pathway for continuous improvements.

Security & Privacy by Design

  • All sensitive data remains within private AWS VPCs with IAM-controlled access and CloudTrail auditing.

  • Bedrock access via VPC PrivateLink endpoints only.

  • Strict data minimization, tagging, retention, reproducibility, and DLP scanning.

  • Model cards, lineage, and evaluation artifacts for each release.

Tech Stack

Modeling: PyTorch, HF Transformers/PEFT/TRL, DeepSpeed/FSDP, vLLM, TensorRT-LLM
AWS/MLOps: Bedrock, SageMaker, ECR, EKS/KServe/Triton, MLflow/W&B, Step Functions
Platform/Security: S3 + KMS, IAM, VPC/PrivateLink, CloudWatch/CloudTrail, Secrets Manager
Bonus: HDL toolchains, vector stores (pgvector/OpenSearch), GitHub/GitLab CI

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.