Infrastructure Development Engineer

Overview

Remote
Hybrid
$60 - $70 per hour
Contract - W2

Skills

Infrastructure Development Engineer

Job Details



Infrastructure Development Engineer


W2 Contract


Salary Range: $124,800 - $145,600 per year


Location: Austin, TX - Remote Role


Duties and Responsibilities:


Platform Reliability & Operations



  • Own end-to-end reliability for our AI Agent Platform across all environments (Dev, Staging, Production).

  • Maintain and optimize EKS clusters, databases, and LangGraph/LangSmith environments.

  • Implement and manage proactive monitoring, alerting, and tracing systems across platform components.

  • Drive root-cause analysis (RCA) and implement incident prevention automations.


Observability & Tooling



  • Deliver a unified observability strategy across services using logging, metrics, and distributed tracing.

  • Lead the migration from DataDog to Mosaic for dashboards and alerting.

  • Develop self-healing automation and smoke tests to validate post-deployment system health.

  • Ensure visibility into latency, availability, and error budgets (SLOs/SLIs).


Support & Incident Management



  • Own the AI platform Support Channel - triage issues, answer platform questions, and guide onboarding.

  • Provide L1/L2 triage during business hours; coordinate after-hours escalation with the core team.

  • Establish structured runbooks, escalation policies, and post-incident review processes.


Deployment & Environment Consistency



  • Standardize infrastructure and CI/CD practices across environments.

  • Partner with platform and ML engineers to streamline release pipelines, security policies, and service configurations.

  • Ensure consistent rollout of new features and agent services with minimal downtime.


Automation & Continuous Improvement



  • Develop Python or Go utilities to automate deployment, monitoring, and maintenance tasks.

  • Build tooling for alert correlation, system diagnostics, and capacity forecasting.

  • Continuously evaluate new tools and frameworks to improve operational efficiency.


Requirements and Qualifications:



  • 4+ years of experience as an SRE, DevOps Engineer, or Platform Engineer in cloud environments

  • Deep expertise with Kubernetes (EKS/GKE), CI/CD pipelines, and infrastructure automation.

  • Proficiency with observability tools such as Grafana, Prometheus, DataDog, Splunk, or OpenTelemetry.

  • Experience in at least one modern programming language (Python, Go, or Rust).

  • Strong understanding of incident management, SLAs/SLOs, and post-mortem practices.

  • Excellent communication and collaboration skills; ability to work across platforms, AI, and data teams.


Preferred Qualifications:



  • Experience operating AI/ML workloads (LangGraph, LangChain, or distributed compute systems like Ray).

  • Familiarity with LLM-based infrastructure and AI observability tooling.

  • Prior experience in managed service transitions or vendor-to-product operating model shifts.

  • Exposure to Azure or AWS cloud ecosystems, Terraform, and GitOps workflows (ArgoCD/Flux).



Bayside Solutions, Inc. is not able to sponsor any candidates at this time. Additionally, candidates for this position must qualify as a W2 candidate.


Bayside Solutions, Inc. may collect your personal information during the position application process. Please reference Bayside Solutions, Inc.'s CCPA Privacy Policy at ;/span>

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Bayside Solutions