Role: AIOps Engineer
Location: Fremont, CA (Onsite role)
Mandatory Skills: Splunk, PowerShell, or Python, Alerts & Logs Monitoring, Confluence and SharePoint
Data Ops Engineer for proactive monitoring, supporting, and ensuring reliability of data pipelines and related infrastructure in an Azure-based ecosystem
5+ years in IT Operations, Data Engineering, or related fields.
Experience in Azure Data Services, ETL/ELT processes, and ITIL-based operations.
2+ years in AIOps implementation, monitoring, and automation.
Skill Requirements:
Basic Understanding of Azure Data Services (ADF, Synapse, Databricks).
Experience in monitoring alerts from data pipelines (Azure Data Factory, Synapse, ADLS, MS Fabric etc.)
Familiarity with ETL/ELT concepts, data validation, and pipeline orchestration.
Experience in identifying failures in ETL jobs, scheduled loads, and streaming data services.
Hands-on experience with IT monitoring tools (e.g. Splunk, Azure Monitor, Dynatrace, or similar tools).
Skilled in creating and updating runbooks and SOPs.
Familiarity with data refresh cycles, batch vs. streaming differences.
Familiarity with ITIL processes for incident, problem, and change management.
Strong attention to detail, ability to follow SOPs, and effective communication for incident updates.
Solid understanding of containerized services (Docker/Kubernetes) and DevOps pipelines (Azure DevOps, GitHub Actions), always with an eye on data layer integration.
Proficiency in Jira, Confluence and SharePoint for status updates and documentation.
Understanding of scripting (PowerShell, Python, or Shell) for basic automation tasks.
Ability to interpret logs and detect anomalies proactively.
Analytical thinking for quick problem identification and escalation.
Preferred:
Exposure to CI/CD for data workflows, real-time streaming (Event Hub, Kafka).
Understanding of Data governance and compliance basics.
Experience with anomaly detection, time-series forecasting, and log analysis.
Key Responsibilities:
Monitor and support data pipelines on Azure Data Factory, Databricks, and Synapse.
Perform incident management, root-cause analysis for L1 issues, and escalate as needed.
Surface issues clearly & escalate to appropriate SME teams so they can be fixed at the root — avoid repetitive short fixes.
Identify whether issues are at pipeline level, data source level, or infrastructure level and route accordingly.
Document incident resolution patterns for reuse.
Acknowledge incidents promptly and route them to the correct team.
Execute daily health checks, maintain logs, and update system status in collaboration tools.
Work strictly as per SOPs documented by the team.
Maintain and update SOPs, runbooks, and compliance documentation.
Update system health status every 2 hours during the shift in Confluence or SharePoint.
Update incident status every 4 hours for P1/P2 tickets.
Complete service tasks on time as per SLA to release queues quickly.
Ensure compliance with enterprise data security, governance, and regulatory requirements.
Collaborate with data engineers, analysts, DevOps/SRE teams and business teams to ensure reliability and security.
Implement best practices in ML operations and productionization.