Work Location: Remote
Engineer/SME
Key Activities:
Unified Monitoring Framework: Build an integrated solution leveraging existing tools (Splunk, Dynatrace, security platforms) and local logs from Windows/Linux servers, infrastructure components (storage arrays, SAN switches, network devices), databases (Oracle, SQL Server, MySQL, MongoDB), backup systems (Rubrik, Data Domain, Infinibox), compute nodes (Dell servers), VMware environments, IBM Power/AIX, and IBM LinuxOne.
Automated Alerting & Proactive Response: Develop intelligent alerting mechanisms and automated remediation workflows to reduce manual intervention and accelerate incident resolution.
Data Integration & Gap Closure: Aggregate and normalize data from multiple sources, including platform tools and local logs, to fill visibility gaps and provide actionable insights.
Dashboard Development: Create a common GUI-based dashboard for real-time monitoring, alerting, and reporting across all infrastructure layers.
Skills & Tools: Utilize Ansible, Python, PowerShell, shell scripting, and GUI development to deliver scalable automation solutions.
Business Impact:
Improved Reliability: Proactive detection and automated remediation reduce outages and service degradation.
Operational Efficiency: Significant reduction in manual monitoring and troubleshooting efforts.
Enhanced Security & Compliance: Centralized visibility into logs and alerts ensures faster response to security events.
Scalability: A common framework supports growth and complexity without proportional increases in headcount.
Job Skill Requirements and Experience:
Core Technical Skills:
Automation & Scripting:
Proficiency in Python, Ansible, PowerShell, and shell scripting (Bash/Korn).
Ability to develop automation workflows for monitoring, alerting, and remediation.
Monitoring & Logging Tools:
Hands-on experience with Splunk, Dynatrace, and other enterprise monitoring platforms.
Familiarity with log aggregation and parsing from multiple sources (OS, applications, infrastructure components).
Infrastructure Knowledge:
Strong understanding of Linux (RHEL) and Windows Server environments.
Exposure VMware, IBM Power/AIX, and IBM LinuxOne systems.
Knowledge of storage arrays, SAN switches, network switches, and IP traffic monitoring.
Experience with backup platforms (Rubrik, Data Domain, Infinibox).
Familiarity with database systems (Oracle, SQL Server, MySQL, MongoDB).
GUI Development: Ability to build dashboard interfaces for real-time monitoring and alerting (using frameworks like Flask/Django for Python or similar).
Additional Skills:
Data Integration: Ability to aggregate and normalize data from multiple sources for unified alerting.
Security & Compliance Awareness: Understanding security logs and compliance requirements for infrastructure monitoring.
Problem-Solving & Creativity: Ability to identify gaps in current monitoring and design innovative solutions.
Experience:
5+ years in infrastructure automation or systems engineering roles.
Proven track record in building automation frameworks and monitoring solutions.
Experience working in large-scale, distributed environments with global teams.
Prior involvement in proactive alerting and automated remediation projects is highly desirable.