Overview
Skills
Job Details
Key Responsibilities:
* Tech Lead - Lead and mentor a team of 15+ engineers for production support and platform stability.
* Engineer more than 50% Production Support and Deployment and follow to application development and migration based on Sprint scope.
* Manage pager duty rotations and ensure timely incident resolution.
* Provide Level 4 support (deep technical troubleshooting and fixes).
* Oversee rotational night shifts (approximately once every 2.5 months).
* Ensure compliance with SLAs and operational excellence for critical systems.
* Collaborate with stakeholders for platform strategy and migration planning.
* Drive Run-the-Engine development work and support enhancements.
* Prepare for and lead the platform migration phase in the third year.
* Monitor application performance, batch jobs, and system health across production and lower environments.
* Respond to incidents, alerts, Sev1/Sev2 outages, and provide real-time support following Client s s Incident Management processes.
* Perform root cause analysis (RCA), create remediation plans, and ensure issues are permanently resolved.
* Support on-call rotations and pager duty responsibilities.
* Collaborate with development, SRE, and infrastructure teams to troubleshoot application, database, and integration issues.
* Execute deployments, configuration changes, and release support using CI/CD pipelines (OnePipeline preferred).
* Create/maintain operational dashboards, runbooks, SOPs, and automation scripts.
* Ensure compliance with Client s technology and security standards.
Required Skills & Experience:
* Tech Lead - Ability to manage large teams (10 50 members) and complex platforms.
* Java Development and Site Reliability Engineering (SRE) expertise.
* Strong experience in production support and incident management.
* Hands-on experience with pager duty tools and support workflows.
* Excellent problem-solving and communication skills.
* Minimum 2 years of experience in similar roles.
* Strong experience in Unix/Linux, shell scripting, and troubleshooting distributed systems.
* Hands-on experience with AWS (CloudWatch, Lambda, EC2, S3, IAM, RDS, DynamoDB).
* Familiarity with Java-based applications, microservices, APIs, and log analysis (Splunk, CloudWatch Logs).
* Experience with CI/CD tools like Jenkins, OnePipeline, Git, and automated deployment strategies.
* Knowledge of incident management, problem management, and change management processes.
* Strong analytical skills and the ability to quickly diagnose complex issues.
Nice to Have
* Experience with SQL, NoSQL, Kafka, or messaging systems.
* Knowledge of Client s environment/tools (OneService, Hygieia, OnePipeline, Splunk dashboards).
* Basic understanding of Kubernetes, Docker, or SRE practices.
* Experience working in financial services or highly regulated environments.