Overview
Skills
Job Details
Hi,
Hope you are doing good, this is Rajeev from FutureTech Consultants, LLC and I have a job opening with our direct client.
Please have a look at the below job description and let me know your interest. Please share me the latest copy of your resume.
Role: Cloud/Platform Operations Manager
Location: Atlanta / Roswell, GA (Onsite)
Duration: 6+ Months (Possible Extension)
Must have cloud operations and Manager title with the below skills set
- Hands-on experience with AWS and/or Azure (multi-account, multi-region operations).
- Solid expertise with observability & monitoring tools (Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK).
- Familiarity with Infrastructure-as-Code (Terraform, Ansible, GitOps).
- Strong understanding of SRE principles (SLIs, SLOs, error budgets, incident management frameworks).
Job Description:
- Rheem is seeking a Manager, Cloud Operations to lead, transform, and scale its digital operations landscape across CloudOps, SRE, NOC, Observability, AIOps, and MLOps.
- This individual will serve as the single point of accountability for operational stability and innovation, managing offshore teams while working closely with Rheem s U.S. digital leadership.
This role is not a steady-state manager position. The successful candidate will:
- Identify operational gaps.
- Suggest and implement best practices and tools.
- Introduce automation and innovation strategies.
- Guide daily deliverables for offshore teams.
- Demonstrate tangible business impact each quarter (improved uptime, reduced MTTR, cost savings, predictive alerting, etc.).
- The Manager will report to the Director of Digital Operations and act as Rheem s Cloud Operations Leader in practice.
Key Responsibilities
Operations Strategy & Governance
- Define the vision, strategy, and roadmap for CloudOps, Reliability, and Operational Excellence.
- Establish KPIs and OKRs aligned with Rheem s business goals (availability, MTTR, cloud cost per device, customer churn reduction).
- Deliver quarterly impact reports to business leadership showcasing operational improvements and ROI.
Cloud Operations & FinOps
- Own multi-region cloud operations across AWS and Azure platforms.
- Drive cost transparency and optimization via FinOps practices and dashboards.
- Build capacity and resiliency models for predictable operations.
- Conduct resiliency drills and game days to ensure high availability and compliance.
Site Reliability Engineering (SRE)
- Establish SLIs, SLOs, and error budgets to measure reliability.
- Build incident management playbooks and drive blameless postmortems.
- Proactively improve reliability through automation, self-healing, and continuous testing.
Network Operations Center (NOC) Modernization
- Transform NOC from alert-driven to predictive, AIOps-enabled operations.
- Consolidate monitoring tools and reduce alert fatigue with intelligent correlation.
- Ensure 24x7 support coverage through offshore team alignment and escalation management.
Observability & Telemetry
- Build a unified observability stack (logs, metrics, traces, RUM) leveraging OpenTelemetry.
- Enable business-oriented dashboards (device uptime, customer adoption, churn trends).
- Ensure end-to-end visibility from connected devices cloud microservices customer-facing apps.
AIOps & MLOps [optional]
- Deploy AIOps solutions for anomaly detection, predictive alerts, event correlation, and automated remediation.
- Operationalize ML models: rollout, monitoring, drift detection, rollback strategies.
- Showcase measurable value, e.g., warranty claim reduction, improved customer experience metrics.
Process Innovation & Automation
- Audit current toolchain and processes; identify redundancies, gaps, and opportunities for automation.
- Align with DevOps/SecOps to streamline release-to-operations handshakes.
- Drive Infrastructure-as-Code for operations (Terraform, Ansible, GitOps).
Team Leadership & Offshore Management
- Manage and mentor a distributed team (offshore + onsite), setting clear goals and accountability.
- Define roles, responsibilities, and shift structures for 24x7 global coverage.
- Build a culture of continuous improvement and operational excellence.
Compliance, Security & Risk
- Ensure Rheem operations align with compliance standards (SOC2, ISO, HIPAA where applicable).
- Own business continuity planning and disaster recovery testing.
- Proactively identify operational risks and mitigate before they impact business.
Business Alignment & Change Leadership
- Act as the voice of operations at business leadership tables.
- Translate technical improvements into business outcomes (lower churn, improved uptime, faster installs, fewer complaints).
- Champion a quarterly innovation agenda to showcase improvements in uptime, cost, and reliability.
Experience & Leadership
- 10+ years of experience in Cloud Operations, Site Reliability Engineering, or Digital Operations.
- Proven track record of owning operational outcomes (uptime, MTTR, cost optimization, observability).
- Experience managing offshore/global delivery teams with 24x7 coverage.
- Strong leadership presence able to act as a change agent, operate autonomously, and deliver measurable outcomes without day-to-day direction.
Cloud & Technical Expertise
- Hands-on experience with AWS and/or Azure (multi-account, multi-region operations).
- Solid expertise with observability & monitoring tools (Datadog, Dynatrace, Splunk, Grafana, Prometheus, ELK/EFK).
- Familiarity with Infrastructure-as-Code (Terraform, Ansible, GitOps).
- Strong understanding of SRE principles (SLIs, SLOs, error budgets, incident management frameworks).
Process & Governance
- Demonstrated ability to design and implement operations frameworks (Ops playbooks, NOC modernization, incident command systems).
- Knowledge of FinOps practices (cloud cost visibility, optimization, showback/chargeback).
- Experience ensuring compliance with SOC2, ISO, HIPAA or equivalent standards.
Soft Skills
- Excellent stakeholder communication skills ability to link operational KPIs with business outcomes.
- Strong team leadership and mentoring skills, especially across distributed teams.
Nice-to-Have
- Exposure to AIOps platforms (Moogsoft, BigPanda, OpsRamp, ServiceNow AI modules).
- Experience with MLOps tooling (MLflow, Kubeflow, SageMaker, Azure ML) for model deployment and monitoring.
- Prior background in platform operations at a product/SaaS company (vs pure IT Ops).
- Experience leading automation-first initiatives (predictive alerts, self-healing infra, auto-remediation pipelines).
- Hands-on experience with CI/CD Ops handshakes and change-impact assessments.
Cloud certifications:
- AWS Certified Solutions Architect / DevOps Engineer
- Microsoft Certified: Azure Administrator / Solutions Architect
- FinOps Certified Practitioner
Regards
Rajeev Mudakala
Sr. Talent Acquisition Specialist
FutureTech Consultants, LLC
5655 Peachtree Parkway, Suite 212, Peachtree Corners, GA 30092
Direct:
&