Costco IT is responsible for the
technical future of Costco Wholesale, the third largest retailer in the world with wholesale operations in fourteen countries. Despite our size and explosive international expansion, we continue to provide a family, employee centric atmosphere in which our employees thrive and succeed.
This is an environment
unlike anything in the high-tech world and the secret of Costco's success is its culture. The value Costco puts on its employees is well documented in articles from a variety of publishers including Bloomberg and Forbes. Our employees and our members come FIRST. Costco is well known for its generosity and community service and has won many awards for its philanthropy. The company joins with its employees to take an active role in volunteering by sponsoring many opportunities to help others.
Come join the
Costco Wholesale IT family. Costco IT is a dynamic, fast-paced environment, working through exciting transformation efforts. We are building the next generation retail environment where you will be surrounded by dedicated and highly professional employees.
The
Principal AI Engineer is the lead Architect and hands-on builder of our unified AI Platform as a Service. This role is responsible for transforming raw foundation model capabilities into a scalable, multi-tenant reasoning stack that empowers the entire enterprise to build, deploy, and manage semantic discovery, conversational intelligence, and autonomous agents. This role will balance 40% hands-on systems development with 60% platform strategy, personally coding the core orchestration engines, standardized capability servers, and universal trust guardrails. The mission is to provide a central 'AI Operating System' for the company, ensuring that specialized agents across different business units can communicate via inter-agent protocols, access grounded knowledge layers securely, and execute autonomous tasks within a governed, high-performance agent runtime environment.
If you want to be a part of one of the worldwide
BEST companies "to work for", simply apply and let your career be reimagined.
ROLE- Develops and implements an industry-leading, self-service AI platform. This platform will offer standardized blueprints for engineering teams to utilize, including "macro-agents" and "micro-tools."
- Develops and articulates the long-term, multi-year technical roadmap for the AI Platform, ensuring its capabilities are strategically aligned with the overarching business objectives.
- Develops and implements complex stategraphs to manage edge cases and enable self-correction in autonomous planning.
- Leads the architecture and hands-on development of remote MCP servers and implements custom function calling to securely connect agents with sensitive enterprise data.
- Defines and implements the communication standards for agent-to-agent interactions to facilitate autonomous discovery and task hand-offs between agents developed by various business units.
- Ensures the agent identity layer is architected for granular permissioning and non-repudiation, specifically regarding every autonomous system action.
- Develops a unified knowledge layer for the platform, leveraging semantic retrieval engine and multimodal grounding. This layer will serve as the single source of truth for all connected agents, providing "truth-as-a-service."
- Develops and implements a global memory bank architecture, leveraging semantic retrieval engine and graph databases. This system will be essential for preserving context and capturing "institutional knowledge" from interactions over time.
- Develops the platform's trust layer by automating rapid evaluation pipelines. These pipelines will measure key metrics of success, cost, and safety for agentic behavior across all tenants.
- Ensures the agent runtime environment lifecycle is managed to guarantee high availability, session persistence, and global scalability for the company's digital workforce.
- Performs in-depth, rigorous code reviews with a specific focus on identifying and mitigating the unique failure modes inherent in agentic systems, such as state bloat, tool call hallucinations, and infinite loops.
- Implements advanced techniques, such as prompt caching and model routing to optimize inference costs and latency.
- Serves as the Engineer's Engineer, providing mentorship to Senior and Staff Engineers in advanced techniques such as prompt engineering, model distillation, and agentic evaluation.
- Influences the roadmap for AI services by collaborating with Cloud and Infrastructure product teams to address enterprise-specific needs.
- Ensures the longevity, scalability, and quality of our systems through continuous improvement, comprehensive documentation, meticulous profiling, and significant performance enhancements.
REQUIRED- 10+ years in Software Engineering, with at least 4 years in a Principal or Architect-level role.
- 2+ years specifically architecting LLM-based systems, with a proven track record of moving agentic projects into production at scale.
- 5+ years of experience developing within an agile methodology.
- Certified Google Cloud Professional Cloud Architect.
- Experience leading technical workstreams, translating business problems into AI-native architectures.
- Expertise in asynchronous orchestration frameworks (e.g., Python) and proficiency in statically typed systems (e.g., Java, Go, or Rust) is required to engineer high-concurrency agentic middleware using stateful graph orchestration (e.g., ADK or LangGraph) to power robust, autonomous reasoning engines.
- Expertise in cloud-native CLI tools and Infrastructure-as-Code frameworks for automating agentic infrastructure deployment. Proven track record of deploying and scaling containerized autonomous workloads using enterprise-grade container orchestration and serverless execution platforms.
- Experience managing high-scale distributed architecture, vector databases, graph databases, and structured data pipelines.
- Deep knowledge of stateful orchestration frameworks and multi-agent design patterns, with the architectural expertise to engineer custom reasoning engines and proprietary orchestration logic when off-the-shelf solutions reach their scaling or safety limits.
- Practitioner understanding of Chain-of-Thought, ReAct, Tree-of-Thoughts, and Self-Reflection architectures.
- Experience managing systems with millions of daily requests or handling multi-petabyte datasets.
- Proficiency in architecting semantic retrieval layers, attribute-aware discovery, and stateful persistence systems to provide high-fidelity long-term context for autonomous agents.
- Deep understanding of MCP, A2A, REST/gRPC APIs, Oauth2 security, and function calling mechanics.
- Familiarity with design patterns and microservices-based architecture patterns.
- Mastery of distributed traceability, neural telemetry, and cognitive debugging suites to audit and visualize logic trajectories across complex inter-agent handoffs.
- Understanding of global AI regulations (e.g., EU AI Act) and how to translate them into technical guardrails.
- Strong verbal and written communication skills and be able to communicate to both technical and Business audiences.
- Ability to work under pressure in crisis with a strong sense of urgency.
- Responsible, conscientious, organized, self-motivated, and able to work with limited supervision.
- Detail-oriented and possess strong problem-solving skills and ability to analyze potential future issues.
- Able to support off-hours work as required, including weekends, holidays, and 24/7 on-call responsibilities on a rotational basis.
Recommended- Bachelor's or Master's in Computer Science or Artificial Intelligence.
- PhD in AI, Distributed Systems, or Cognitive Science.
- Certified Google Cloud Professional Machine Learning Engineer or any agentic AI specialty certification focusing on multi-agent Systems and autonomous reasoning.
- 3+ years distributed cache technologies
- Experience with deploying and configuring Cloud Platform resources.
- Experience working in a retail ecommerce environment.
- Proficient in Google Workspace applications, including Sheets, Docs, Slides, and Gmail.
Required Documents Cover Letter
Resume
California applicants, please click here to review the Costco Applicant Privacy Notice.
Pay Range: $160,000 - $230,000, Bonus and Restricted Stock Unit (RSU) eligible
We offer a comprehensive package of benefits including paid time off, health benefits - medical/dental/vision/hearing aid/pharmacy/behavioral health/employee assistance, health care reimbursement account, dependent care assistance plan, short-term disability and long-term disability insurance, AD&D insurance, life insurance, 401(k), stock purchase plan to eligible employees.
Costco is committed to a diverse and inclusive workplace. Costco is an equal opportunity employer. Qualified applicants will receive consideration for employment without regard of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or any other legally protected status. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request to
If hired, you will be required to provide proof of authorization to work in the United States. Applicants and employees for this position will not be sponsored for work authorization, including, but not limited to H1-B visas.