Job Title: Senior Java SRE Location: McLean, VA (Hybrid/Initial Remote options depending on end-client) Duration: 12+ Months
Job Description:
15+ years of experience with related tools and technologies.
Experience with below skillsets:
o Java (JVM internals, tuning, microservices)
o AWS Cloud (EKS, EC2, IAM, VPC, RDS, CloudWatch)
o Kubernetes (CKA/CKS-level depth)
o Docker, Terraform
o CI/CD: GitLab CI/CD, Jenkins
o Streaming: Kafka, KSQLDB, Spark Streaming
o Service Mesh: Istio, Anthos Service Mesh
o Monitoring: Prometheus, Datadog, Splunk, Kiali
o OS & Scripting: Linux/Unix, Bash
o Programming: Python or Go
o Virtualization: VMware
o Networking & Performance: Nginx Controller, Seesaw, eBPF
o Experience supporting core banking, payment gateways, or trading platforms
o Exposure to high-frequency transaction systems
o Knowledge of regulatory audits and compliance controls
o Experience with zero-downtime deployments and disaster recovery strategies
Certifications below are mandatory:
o AWS Certified Solutions Architect/Professional or AWS DevOps Engineer/Professional
o Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
Responsibilities:
Design, build, and operate highly available, fault-tolerant systems supporting core banking, payments, and trading platforms.
Lead SRE practices including SLIs, SLOs, error budgets, and reliability-driven engineering decisions.
Provide L3/L4 incident response, root cause analysis (RCA), and post-incident remediation for production systems.
Support and optimize Java-based microservices running on Kubernetes (EKS).
Implement and manage AWS-native services (EC2, EKS, RDS, DynamoDB, S3, IAM, CloudWatch).
Develop automation using Terraform for infrastructure provisioning and policy enforcement.
Manage Kubernetes networking, storage, and service mesh integrations including Istio / Anthos Service Mesh.
Implement advanced Kubernetes storage solutions using Portworx.
Architect and maintain enterprise-grade CI/CD pipelines using GitLab CI/CD, Jenkins, and cloud-native tooling.
Automate manual operational tasks using Python, Go, Bash, and infrastructure-as-code patterns.
Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards.
Utilize eBPF for deep kernel-level observability and performance tuning.
Support real-time data platforms using Kafka, KSQLDB, Kafka Streams, Spark Streaming.
Manage multi-cluster Kubernetes environments, including cluster federation.
Optimize system performance, scalability, and latency under high transaction volumes.
Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access.
Support environments aligned with SOC2, PCI-DSS, SOX, and internal banking security standards.
Provide 247 operational support, including rotational shifts, weekends, and on-call coverage across all U.S. time zones.