Overview
Remote
Depends on Experience
Full Time
Skills
Google Cloud Platform
DevOps
Kubernetes
Machine Learning Operations (ML Ops)
Python
Artificial Intelligence
Apache Spark
Grafana
Docker
Job Details
About the Role:
As an AIOps Engineer, you will play a critical role in enhancing the reliability, efficiency, and performance of Google Cloud's vast and complex infrastructure.
You will leverage your expertise in Artificial Intelligence and Machine Learning to design, implement, and optimize intelligent automation solutions for IT operations, ultimately improving the experience for our global customers.
This position offers a unique opportunity to work on challenging problems at scale, contribute to the evolution of cloud operations, and collaborate with world-class engineers and researchers at Google.
Responsibilities:
- Design, develop, and implement AIOps solutions to automate routine operational tasks, detect anomalies proactively, and enable self-healing capabilities across Google Cloud infrastructure.
- Apply machine learning algorithms to large-scale operational data (logs, metrics, traces, events) to predict system failures, identify root causes, and optimize resource utilization.
- Build and maintain robust data pipelines for collecting, processing, and analyzing diverse IT operational data from various sources.
- Collaborate closely with Site Reliability Engineers (SREs), software developers, and infrastructure teams to integrate AIOps solutions into existing workflows and systems.
- Develop and implement monitoring and alerting systems that leverage AI-driven insights to ensure the reliability, availability, and performance of cloud services.
- Contribute to the continuous improvement of AIOps platforms and tools, staying current with industry trends and advancements in AI/ML, cloud computing, and IT operations.
- Troubleshoot and resolve complex platform-related issues, ensuring minimal impact on critical AI/ML operations and customer services.
- Generate reports and visualizations to provide actionable intelligence and communicate insights to stakeholders.
Minimum Qualifications:
- 5+ years of experience in platform engineering, DevOps, Site Reliability Engineering (SRE), or IT operations, with a focus on automation and system reliability.
- Strong programming skills in Python, Go, Java, or C++.
- Experience with cloud platforms (e.g., Google Cloud Platform, AWS, Azure) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with data processing frameworks (e.g., Apache Kafka, Apache Spark) and IT monitoring tools (e.g., Prometheus, Grafana, Splunk, ELK stack).
- Understanding of machine learning algorithms and concepts, with practical experience in applying them to operational data for anomaly detection, predictive analytics, or root cause analysis.
- Excellent problem-solving, analytical, and communication skills.
- Ability to work collaboratively in a fast-paced, dynamic environment.
Preferred Qualifications:
- Experience with MLOps practices, including model deployment, evaluation, and lifecycle management in production environments.
- Familiarity with large-scale distributed systems and microservices architectures.
- Knowledge of AI/ML frameworks such as TensorFlow, PyTorch, or scikit-learn.
- Experience in building self-healing systems and implementing automated remediation workflows.
- Google Cloud certifications (e.g., Professional Cloud Architect, Professional Data Engineer, Machine Learning Engineer).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.