Role: Platform Ops Location: Berkeley Heights, NJ (Onsite 5 days)
JD:
Candidates should have hands-on experience with AWS, Kubernetes, Terraform, core cloud networking, and familiarity with modern AI-enabled engineering tools. These capabilities are critical to building and efficiently managing the platform.
The main responsibilities for this role include:
. Designing, building, and operating scalable platform infrastructure on AWS.
. Managing and supporting Amazon EKS clusters across production and non-production environments.
. Administering and optimizing AWS services such as DynamoDB, Amazon RDS, CloudWatch, Route 53, AWS Secrets Manager, and AWS Certificate Manager.
. Configuring Kubernetes components, including ingress controllers, namespaces, networking, and workload deployments.
. Implementing and maintaining ingress patterns using Kubernetes Ingress Controller and AWS ALB.
. Managing DNS, hosted zones, and routing traffic with Route 53.
. Handling secrets, credentials, and secure application configuration using AWS Secrets Manager.
. Provisioning, renewing, and managing TLS/SSL certificates via AWS Certificate Manager for secure application and ingress communication.
. Developing, maintaining, and improving Infrastructure as Code with Terraform and Terraform Enterprise.
. Supporting AWS networking components, including VPCs, subnets, route tables, internet gateways, NAT gateways, and security groups.
. Configuring and troubleshooting network connectivity between applications, clusters, databases, and external endpoints using tools like ping, curl, telnet, traceroute, nslookup, dig, and port-level validation.
. Reviewing and maintaining network access controls, security group rules, and routing paths to ensure secure and reliable communication.
. Troubleshooting platform, infrastructure, DNS, certificate, secret management, load balancing, and network-related issues across environments.
. Monitoring platform health, creating dashboards, defining alerts, and improving observability using CloudWatch.
. Partnering with development and security teams to enhance CI/CD, platform reliability, and compliance.
. Supporting incident response, root cause analysis, and operational readiness activities.
. Driving automation for platform operations, environment creation, upgrades, patching, and recovery procedures.
. Leveraging AI-assisted engineering tools for scripting, automation, troubleshooting, documentation, and operational efficiency.
. Evaluating opportunities for AI-powered tooling to improve platform support workflows, knowledge management, and incident response.
. Maintaining technical documentation, architecture diagrams, runbooks, and operational standards.