Hi,
Please check the job description as below and let me know you if you would be interested and available. Please let me know your available time for a quick call.
Role: Senior AWS Agentcore Platform Engineer
Position Type: Contract to hire after initial 6 months
Location: Reading, PA or Exton, PA (Hybrid 2-3 days a week from office)
Job Description:
1. Observability & Distributed Tracing
Gap Analysis: Assess AWS CloudWatch, X-Ray, Bedrock logging, and AgentCore traces against agentic workflow requirements; produce a comprehensive gap analysis and lead the setup of observability within Dynatrace.
Validation Pipelines: Design and implement post-deployment validation pipelines for agents and Model Context Protocol (MCP) servers, ensuring deployment health and successful tool registration.
Tracing & Logging: Implement distributed tracing and structured logging to capture LLM decision logic, tool selections, sub-agent calls, and MCP interactions.
Architecture Strategy: Evaluate LangFuse and LiteLLM proxies against AWS-native solutions; deliver a target-state observability architecture recommendation.
2. Cost Tracking & TCO (Total Cost of Ownership)
Taxonomy Expansion: Extend tagging taxonomy to capture costs across agent runtimes, MCP servers, vector databases, and Bedrock token consumption per namespace.
Cost Modeling: Design a granular cost visibility model to aggregate expenses for agents, MCPs, and LLM tokens by team and department.
Dashboards & Alerting: Build CloudWatch (or equivalent) dashboards for per-team spending; configure AWS Budgets with proactive alerting thresholds.
Automation: Automate cost reporting via email and Microsoft Teams, incorporating anomaly detection rules to identify spend spikes.
3. Monitoring & Incident Management
Alerting Framework: Define and implement P1 P4 alerting rules covering deployment failures, runtime errors, tool invocation failures, and MCP connectivity issues.
Incident Integration: Integrate alert notifications with Microsoft Teams and email, utilizing resource ownership tags for intelligent routing.
Operational Excellence: Author detailed runbooks for every alert; publish and maintain these in Confluence to facilitate developer self-service resolution.
Stack Evaluation: Compare AWS-native vs. third-party monitoring stacks to deliver a long-term recommendation aligned with the broader observability architecture.
4. Security & Governance
Risk Assessment: Evaluate current IAM and tagging strategies for multi-team isolation; identify scalability gaps and potential security risks.
Policy Engines: Assess the Cedar policy engine (AgentCore) for fine-grained tool access control and document gaps for enterprise-scale deployment.
Identity Architecture: Design a scalable Attribute-Based Access Control (ABAC) identity model to ensure multi-team isolation without IAM policy sprawl; deliver production-ready Terraform modules.
Isaac Rajiv
Kutir Corporation
Ph: