Production Engineer II Java / SQL / Automation (AWS)
Alpharitta, GA (Onsite)
Role Summary
We are seeking a Production Engineer II to support and improve the reliability, performance, and day-to-day operations of critical production systems. This role combines hands-on troubleshooting with automation, strong SQL skills, and operational excellence practices. The ideal candidate is comfortable working in AWS environments, improving observability, reducing toil, and partnering with engineering teams to deliver stable, scalable services.
Key Responsibilities
Provide production support for Java-based services and data-driven applications, including incident triage, root-cause analysis, and remediation.
Build automation to reduce manual operational work (alerts, self-healing actions, runbooks, deployment checks, and routine maintenance).
Write and optimize SQL for troubleshooting, data validation, reconciliation, and performance analysis; partner with database teams as needed.
Improve service reliability through proactive problem management, capacity planning, and performance tuning.
Enhance observability (dashboards, logs, metrics, tracing) and improve alert quality to reduce noise and accelerate resolution.
Support release/change activities: deployment readiness, rollback planning, post-release verification, and incident prevention.
Participate in on-call rotation and lead efforts to reduce recurring issues and mean time to restore (MTTR).
Document operational procedures and contribute to continuous improvement (postmortems, corrective actions, standards).
Required Qualifications
Bachelor s degree in Computer Science, Engineering, or equivalent experience.
2+ years of experience in production support, site reliability, or operations engineering for enterprise applications.
Strong Java fundamentals and ability to debug application behavior using logs, stack traces, and runtime metrics.
Strong SQL skills (joins, aggregations, query optimization) and experience working with relational databases in production contexts.
Demonstrated experience building automation using scripting (e.g., Python, Bash, PowerShell) and/or CI/CD pipelines.
Familiarity with operational best practices: incident management, problem management, change control, and post-incident reviews.
Ability to work under pressure, communicate clearly during incidents, and collaborate across teams.
Preferred Qualifications
Hands-on experience with AWS services such as CloudWatch, EC2, S3, RDS/Aurora, IAM, Lambda, Systems Manager, EKS/ECS, and/or SNS/SQS.
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation) and configuration management.
Familiarity with API gateways, message queues/streams, and distributed system troubleshooting.
Experience improving operational KPIs (MTTR, change failure rate, availability, incident volume reduction).
Financial services or other regulated-industry experience.
What Success Looks Like
Reduces recurring incidents through automation and preventative fixes.
Improves observability and alerting to shorten time-to-detect and time-to-recover.
Delivers measurable operational excellence outcomes (higher stability, fewer escalations, smoother releases).
Working Model / On-Call
Participation in an on-call rotation is required.
Occasional after-hours support may be needed for high-severity incidents or planned changes.