You will join a globally distributed engineering organization of fewer than forty people as the tech lead for a scrum team and the people manager for that team. You’ll be accountable for architecture and execution, writing production code, and partnering across Product, Security, and Platform/Reliability to ensure AI features are trustworthy in the real world (evaluation/monitoring, tenant isolation, permissioned tool use, cost controls, and incident-ready operations).
This is a hands-on technical leadership role. Expect to spend significant time designing and shipping production code (Python on AWS). If you are primarily a people manager and are not currently hands-on, this role will not be a fit.
What You’ll Do
Build & Run the AI Platform Layer (Hands-On)
· Lead architecture and delivery of our AI platform services in Python 3.12+ using proven service patterns & platforms (FastAPI, Uvicorn, Pydantic, SQLModel/SQLAlchemy) and production-grade API behavior.
· Own AWS runtime and deployment patterns for the platform: ECS Fargate (API + MCP services), Lambda (doc processing + knowledge ingestion), and event-driven integration via S3 Events and EventBridge.
· Establish “paved road” standards so teams ship safely: service templates, PR/review discipline, CI/CD and environment promotion using Terraform + GitHub Actions (OIDC to AWS), and Docker build practices (including multi-arch where required).
Document Intelligence Pipelines (Warranty Docs → Structured Data)
· Own the end-to-end document processing pipeline (Amazon Textract, Claude Vision).
· Improve extraction quality using deterministic parsing/normalization and custom extractors (e.g., VIN, dates, currency, codes), with strong validation, traceability, and clear failure modes.
· Engineer for reliability and reprocessing: idempotency, bounded retries/timeouts, durable error handling, and controlled replay of failed/changed documents.
Retrieval & Knowledge Base Engineering (RAG That’s Measurable)
· Own the full lifecycle of Amazon Bedrock Knowledge Bases (multiple KBs such as policies/TSBs/procedures/codes): ingestion strategy, change control, and safe promotion across environments.
· Build evaluation and regression testing for retrieval quality (golden sets, automated checks, drift detection) and enforce quality gates so KB changes don’t silently degrade outcomes.
· Implement cost-aware, AWS-native retrieval using Bedrock RetrieveAndGenerate with vector storage in S3 Vectors, and track unit economics (latency and cost per workflow/document/claim).
Agent Workflows & Guardrails (Production, Not Demos)
· Deliver agent orchestration on Bedrock AgentCore Runtime using LangGraph state machines (checkpointing, interrupts, human-in-the-loop steps) with predictable behavior and well-defined failure handling.
· Integrate tools/connectors via the MCP SDK (e.g., DMS connector, VIN decoder, OEM portal tools) with permissioned access, auditable tool calls, and strict boundaries.
· Standardize operational guardrails: CloudWatch logs/metrics (structured JSON logging), security-by-default (Cognito OIDC/PKCE, WAF, Secrets Manager, least-privilege IAM), and runtime discipline for ARM64 AgentCore containers (repeatable builds, including QEMU multi-arch in CI when needed).
The Profile We’re Looking For
You bring strong technical depth, pragmatic leadership, and you can mentor and develop talent. You are proficient being hands-on, and you can scale standards, systems, and a team over time — balancing rapid iteration with production discipline and cost awareness.
· 8–12+ years building and operating backend/platform systems in B2B SaaS.
· Proven hands-on technical leadership for a small team (4–6); comfortable being accountable for architecture and delivery.
· Mastery of Python in production: FastAPI/services, async patterns, workflow orchestration, test discipline, and observability.
· Strong AWS depth (Lambda, ECS/Fargate, S3, IAM, RDS/Postgres, CloudWatch/EventBridge) plus IaC (Terraform preferred).
· Direct experience shipping AI-enabled systems on AWS (Bedrock/RAG, document intelligence such as Textract or multimodal extraction, and evaluation/quality monitoring).
· Experience building production agent workflows with guardrails (permissions, auditability, cost controls, failure modes).
Education
Bachelor’s degree in Computer Science, Engineering, or a related field — or equivalent professional experience delivering and operating enterprise-grade software platforms. Advanced degree is a plus but not required.