SRE Lead & Monitoring Consultant
Key Responsibilities
SRE Practice Development
• Assess operational maturity and build SRE transformation roadmap
• Establish SLOs, SLIs, and error budgets for critical services
• Design incident management processes and on-call strategies
• Implement chaos engineering and resilience testing
• Mentor teams on SRE principles and best practices
Monitoring & Observability
• Deploy and configure Datadog, Splunk, Grafana, and Prometheus
• Implement metrics collection, log aggregation, and APM
• Build custom dashboards and alerting configurations
• Set up anomaly detection and intelligent alerting
• Configure automated health checks and remediation
• Establish golden signals monitoring (latency, traffic, errors, saturation)
Reliability & Compliance
• Conduct reliability reviews and performance optimization
• Design disaster recovery and failover procedures
• Implement security monitoring and audit logging
• Configure fraud detection and transaction monitoring
• Create runbooks and operational documentation
Required Qualifications
Experience:
• 7+ years in Site Reliability Engineering, DevOps, or infrastructure engineering
• 3+ years in SRE leadership roles.
The ideal candidate will possess strong expertise in Java, Node.js, Kafka, AWS Cloud, and modern AIOps/Observability practices.
Implement proactive monitoring and predictive alerting using AIOps platforms and machine learning-driven insights.
• 3+ years hands-on experience with Datadog, Splunk, Grafana, and Prometheus.
Strong hands-on experience with Java and Node.js application architectures.
• Previous experience in fintech or regulated industries.
• Proven track record building SRE practices from scratch.
Technical Skills
• Deep understanding of SRE principles, error budgets, and SLO/SLI frameworks.
• Expertise with cloud platforms (AWS, Azure, or Google Cloud Platform).
• Proficiency with Kubernetes, Docker, and infrastructure as code (Terraform, Ansible).
• Strong programming/scripting skills (Python, Go, Bash).
• Experience with incident management and post-mortem culture.
• Knowledge of compliance requirements (SOC 2, PCI-DSS, ISO 27001).
Soft Skills
• Exceptional leadership and mentoring abilities.
• Strong communication and stakeholder management.
• Data-driven decision-making approach.
• Collaborative mindset with ability to drive cultural change.
Preferred Qualifications
• Cloud certifications (AWS, Google Cloud Platform, Azure) or Kubernetes certifications (CKA/CKAD).
• Experience with ELK stack.
• Background in cloud cost optimization.
• Multi-cloud or hybrid cloud experience.
Deliverables
• SRE maturity assessment and transformation roadmap
• Fully configured monitoring stack with Datadog, Splunk, Grafana, and Prometheus
• SLO/SLI definitions and error budgets
• Custom dashboards, alerting, and automated remediation
• Incident management framework and runbooks
• Chaos engineering test suite