Role Overview
We are seeking an experienced SRE/DevOps Architect to design and implement scalable, reliable, and secure architecture solutions that support business growth and ensure system resilience. The ideal candidate will integrate SRE and DevOps practices, drive automation, and build observability frameworks to enhance system availability and performance.
Key Responsibilities
Design and implement scalable, fault-tolerant architecture solutions that meet business objectives and performance goals.
Collaborate with cross-functional teams to embed SRE and DevOps principles into existing workflows.
Develop and maintain automation frameworks using tools such as Terraform, Docker, and Kubernetes to improve system reliability and reduce manual effort.
Monitor and optimize system performance and reliability using observability tools like Data Dog, Prometheus, Grafana, Splunk, and the ELK Stack.
Ensure high availability through proactive monitoring, alerting, and incident response practices.
Implement CI/CD pipelines and manage code repositories using GitHub to streamline development and deployment processes.
Leverage AWS services to build scalable cloud-based infrastructure, ensuring optimal performance and cost efficiency.
Analyze and document system requirements, design patterns, and implementation details for ongoing maintenance and scalability.
Provide technical leadership and mentorship to team members, fostering a culture of knowledge sharing and continuous improvement.
Conduct architecture reviews to identify potential risks, implement mitigation strategies, and ensure adherence to best practices.
Collaborate with stakeholders to align technical strategies with business objectives and customer needs.
Stay current with emerging technologies and industry trends to continuously enhance architecture, automation, and observability capabilities.
Qualifications
Strong experience in Site Reliability Engineering (SRE) and DevOps, with proven success designing and deploying large-scale, cloud-native systems.
Proficiency in Java, AWS, Docker, Kubernetes, Terraform, GitHub, and observability tools such as Data Dog, Splunk, Prometheus, Grafana, and the ELK Stack.
Expertise in infrastructure as code, automation, and monitoring as a code, frameworks to ensure system resilience and proactive issue detection.
Strong analytical and troubleshooting skills, capable of resolving complex infrastructure and application issues.
Excellent communication and collaboration skills, with the ability to work effectively across engineering, operations, and product teams.
Bachelor s degree in Computer Science, Information Technology, or a related field.
Preferred Certifications: AWS Certified Solutions Architect or equivalent credentials.
Demonstrated commitment to continuous learning, innovation, and adoption of emerging DevOps and SRE practices.