Job Title: Senior AWS Agentcore Platform Engineer
Location: Reading, PA (Hybrid 2-3 days a week from office)
Job Type: Full time position
Interview process: Team Interview
Job Description
We are looking for a highly technical Lead Platform Engineer to architect the observability, cost governance, and security framework for our enterprise AI agent ecosystem. You will be responsible for ensuring our agentic workflows-built on AWS Bedrock, AgentCore, and MCP servers-are scalable, observable, and cost-efficient.
The ideal candidate bridges the gap between traditional DevOps and the emerging world of LLMOps, with a deep focus on distributed tracing for non-deterministic AI workloads.
Requirements
Experience: 8+ years in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
Cloud Expertise: Deep proficiency in AWS (IAM, CloudWatch, Bedrock, Lambda).
Observability Tools: Proven experience with Dynatrace, Jaeger, or Honeycomb, and distributed tracing standards.
AI/LLM Interest: Familiarity with the LLM lifecycle, including prompt execution, token usage, and frameworks like LangChain or AgentCore.
Automation: Advanced experience with Terraform and CI/CD pipeline design.
Collaboration: Experience working in an Agile environment with integrated tools like Microsoft Teams and Confluence.
Job Responsibilities
- Observability
- Assess CloudWatch, X-Ray, Bedrock logging, AgentCore traces vs. agentic workflow requirements; produce gap analysis, Setup observability in Dynatrace
- Design post-deployment validation pipeline for agents & MCP servers (deployment health + tool registration checks)
- Implement distributed tracing & structured logging: LLM decisions, tool selections, sub-agent calls, MCP interactions
- Evaluate LangFuse / LiteLLM proxy vs. AWS-native; deliver target-state observability architecture recommendation
- Cost Tracking & TCO
- Extend tagging taxonomy to cover agent runtimes, MCP servers, vector DBs, Bedrock token consumption per namespace
- Design cost visibility model: aggregate agent, MCP, vector DB, and Bedrock token costs per team/department
- Build CloudWatch (or equivalent) dashboards for per-team spend; configure AWS Budgets with alerting thresholds
- Automate cost reports delivered via email / Microsoft Teams; implement anomaly detection rules
- Monitoring & Alerting
- Define P1 P4 alerting rules: deployment failures, runtime errors, tool invocation failures, MCP connectivity issues
- Integrate alert notifications to Microsoft Teams channels and email; route by resource ownership tags
- Author runbooks linked to every alert; publish in Confluence for developer self-service resolution
- Evaluate AWS-native vs. third-party monitoring stack; deliver recommendation aligned to observability architecture
- Security & Access Control
- Assess current IAM + tagging approach for multi-team isolation; identify scalability gaps and risks
- Evaluate Cedar policy engine (AgentCore) for fine-grained tool access control; document enterprise-scale gaps
- Design scalable ABAC-based identity model for multi-team isolation without IAM policy sprawl; deliver Terraform modules