Site Reliability engineer (SRE)

Overview

Remote
$DOE
Full Time
Contract - W2

Skills

AWS

Job Details

SRE Key Responsibilities

Design and manage multi-account AWS infrastructure (VPC, Route Tables, EC2, ECS, EKS 1.33, RDS, DynamoDB, Elastic ache Valley, S3, Transit Gateway, Resource Access Manager, Lambda, CloudFormation, AWS Backup)

Configure load balancing and traffic management (ELB, NLB, Target Groups with gRPC, Route53, Global Accelerator, CloudFront)

Implement security and compliance controls (IAM, IAM Identity Center, SCP, = Guard Duty, WAF, CloudTrail, ACM, Secrets Manager, OKTA integration)

Manage Cloudflare infrastructure (Zero Trust, Argo Smart Routing, DNS, Workers, Load Balancer, Bot Management, WAF, Rules & Policies, Cache)

Manage S3 with Access Policies, Lifecycle Policies, S3 Storage Lens optimization, and cross-region replication

Operate messaging and notification services (SNS, SES, SQS)

Architect and manage multi-cluster EKS environments with HA and cross-region DR

scenarios using Istio service mesh, Network Policies, Karpenter, HPA, KEDA, Argo CD, Argo Rollouts

Implement and maintain Argo CD for multi-cluster application management with HA and cross-region DR configurations

Configure Argo CD Application Sets for managing applications across multiple EKS clusters

Implement ECR with global cross-region replication for container image distribution and disaster recovery

Implement Aurora Global Database for cross-region DR, manage Aurora RDS (MySQL and PostgreSQL) and standalone MySQL/PostgreSQL instances for development

Design and maintain RDS cross-region replication, automated backups, failover strategies, and upgrade procedures

Establish and maintain DevOps practices including change management, release management and deployment strategies

Build resilient CI/CD pipelines with cross-region artifact replication, automated testing, and failover capabilities

Develop and maintain GitHub Actions shared internal workflows and reusable actions for standardized deployments

Implement change approval workflows, deployment gates, and release coordination

processes

Implement Cross plane for automated feature environment creation, upgrades, and AWS resource provisioning

Deploy applications using Helm, Customize with Overlay Patches, Json net, and Cross plane for infrastructure orchestration

Maintain platform operators (External DNS, External Secrets, Reloader) and custom CRDs

Build comprehensive observability stack & Dashboards (Grafana, Thanos/Prometheus, Loki, Alert manager, Open Telemetry Alloy/Tempo/Beyla/Pyro scope)

Configure exporters (Blackbox, MySQL, Redis, YACE CloudWatch, Cloudflare, Node Exporter, Prometheus Push Gateway)

Support data platforms (Kafka/Kafka UI, Minion, Airflow, JupyterHub, DASK, Superset, Imply, AWS Glue, Athena, Quick Sight, Bedrock)

Optimize CI/CD with GitHub Actions, Actions Runner Controller (ARC), runs-on.com, GitHub Rulesets

Manage mobile app delivery pipelines (Unity Build Management, Fastlane, Google Play Developer, Apple Developer/Enterprise, Applivery)

Implement and maintain all infrastructure using Terraform/Open Tofu with Scalr, backporting existing resources into code

Automate operational tasks wherever possible; create comprehensive runbooks for no automatable procedures

Conduct thorough post-mortem analysis after incidents, documenting learnings and implementing preventive measures

Drive cost optimization initiatives using S3 Storage Lens, CloudWatch metrics, rightsizing recommendations, and resource lifecycle management

Develop automation in Bash, Python, Go, C#/.NET (Unity Game Engine)

Maintain developer experience (Backstage, Click Up, Miro, Shared GitHub Action/Workflows)

Integrate monitoring and alerting (PagerDuty, Cronitor, Wiz, CloudWatch)

Core Expertise:

Multi-account AWS architecture with Transit Gateway, Resource Access Manager, VPC design, and Route Tables

Kubernetes/EKS high availability with cross-region disaster recovery scenarios

Multi-cluster EKS management with service mesh (Istio), autoscaling (Karpenter, KEDA), GitOps (Argo CD)

Argo CD enterprise deployment for multi-cluster application management with HA and cross-region DR

Argo CD Application Sets, app-of-apps patterns with Helm, and cluster management strategies

ECR global cross-region replication strategies for container image distribution and DR

Cloudflare enterprise features (Zero Trust, Argo Smart Routing, DNS management, Workers, Load Balancer, Bot Management, Cache optimization, WAF Rules & Other Security Policies)

Aurora Global Database implementation and management for cross-region DR

Aurora RDS (MySQL and PostgreSQL engines) and standalone MySQL/PostgreSQL instance management

RDS cross-region replication, automated failover, disaster recovery, and version upgrade strategies

DevOps best practices including change management, release management, and deployment coordination

Resilient CI/CD pipelines with automated testing, cross-region artifact distribution, and failover

GitHub Actions shared workflows and reusable actions development for internal use

Cross plane for Kubernetes-native infrastructure provisioning, feature environment automation, and upgrade orchestration

Expert-level Terraform/Open Tofu with enterprise policy management (Scalr)

Infrastructure backporting and migration from ClickOps to IaC

Complete observability stack (Prometheus, Grafana, Loki, Open Telemetry, distributed tracing)

Data pipeline orchestration (Kafka, Airflow) and analytics platforms (Superset, Imply)

GitHub Actions with self-hosted runners (ARC, runs-on.com)

Proficiency in Python, Bash, Go, and C#/.NET for automation development

Security implementations (IAM, SCP, OKTA, WAF, Guard Duty, Wiz)

Mobile CI/CD (Unity, Fastlane, Apple/Google distribution & Applivery during Development)

Disaster recovery planning, testing, and automation (AWS Backup, cross-region strategies)

AI/ML infrastructure experience (AWS Bedrock)

Cost optimization strategies and Quick Sight for AWS Cost Review

Post-mortem facilitation and blameless incident analysis

Runbook creation and maintenance for operational procedures

Technical Skills:

Container orchestration with advanced networking and progressive delivery

Infrastructure as Code and GitOps methodologies with automation-first mindset

Change management workflows, approval gates, and release orchestration

CI/CD pipeline design with automated testing, security scanning, and deployment strategies

Incident response, on-call management, post-mortem analysis, DR execution

Cross plane composition design and custom resource definitions

Custom CRD and operator development in Kubernetes

Event-driven architecture (Lambda, SQS, SNS, SES)

Real-time analytics and BI platforms

Developer portal management (Backstage)

Multi-region failover automation and orchestration

Cost analysis and optimization using native AWS tools

Automation of repetitive operational tasks

Technical documentation and runbook authoring

Database performance tuning and optimization (Aurora, MySQL, PostgreSQL)

Argo CD backup, restore, and disaster recovery procedures

Cloudflare Workers development & deployment using Wrangler

Soft Skills: Strong troubleshooting, cross-functional communication, self-directed, documentation-focused, cost-conscious, continuous improvement mindset


ACI (Advanced Computing International) is a Global Technology Services, Products & Solutions Company focused on designing and delivering the next generation applications and digital experiences for businesses and consumers. We specialize in Big Data & Analytics, Digital Transformation, IT Service Management, Cognitive Solutions, Artificial Intelligence, IOT & Future Networks, DevOps, Enterprise Applications & Managed Infrastructure Services & Industry Specific Solutions.

Leveraging the insights gained from working on innovative solutions and disruptive technologies, ACI develops Solutions to enhance business performance, accelerate product & applications time-to-market, harmonize Consumer Experiences and streamline their business operations. ACI works with clients across different business sectors: Financial Services, Healthcare, Manufacturing, Hi-Tech, Media, Utilities, Public sector, Retail, Telecom, E-commerce & Logistics, and Higher Education. ACI s core DNA is built on Innovation and co-existence to build a collaborative ecosystem where companies and consumers win.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.