Job Title: System Administrator/ System Engineer/Infrastructure Engineer
Duration: Redmond, WA
Duration: 24 Months
Job Description
Role Summary
Responsible for 24x7 monitoring, incident management, and operational support of a large-scale hybrid infrastructure including servers, virtualization platforms, storage systems, network devices, and applications. Ensure high availability, performance, and reliability across all environments (Prod, DR, Non-Prod).
Key Responsibilities
Infrastructure Monitoring & Operations
- Monitor ~1200 + servers (Windows/Linux), virtualization platforms (VMware, Nutanix), and web servers for performance and availability.
- Oversee storage systems (PB-scale: Quantum, Isilon, NAS, SAN) ensuring uptime and capacity health
- Monitor network infrastructure (1200+ devices) includes switches, routers, firewalls, VPN tunnels, WAPs, and ISP circuits.
- Monitor and action on the incidents, requests related to the Infra and tools hosted in the environment.
Incident & Event Management
- Perform L1/L2 triage for alerts, incidents, and outages across infrastructure and applications
- Ensure timely incident resolution, escalation, and communication as per SLAs
- Correlate alerts across tools to identify root causes and reduce noise
Application & Service Monitoring
- Monitor 50+ applications across multiple environments (Prod, DR, UAT, Dev)
- Track service health, availability, and dependencies (web, middleware, backend systems)
Capacity & Performance Management
- Track utilization trends across computing, storage (multi-PB), and network
- Proactively identify bottlenecks and recommend optimization
Change & Release Support
- Support infrastructure and application deployments, patches, and maintenance activities
- Validate system health pre/post changes
Disaster Recovery & Resilience
- Support DR readiness for large-scale storage and application environments
- Participate in DR drills and failover validation
Reporting & Documentation
- Maintain operational dashboards, runbooks, and incident reports
- Provide daily/weekly health and SLA reports
Required Skills
Strong knowledge of:
- Windows & Linux server administration (basic troubleshooting L1 and L 1.5)
- Virtualization: VMware & Nutanix ( L1 & L 1.5)
- Storage systems: SAN/NAS, Isilon, Quantum or similar PB-scale storage
- Networking fundamentals: TCP/IP, DNS, VPN, Firewalls, Load Balancers (F5) (L1 an L1.5)
- Experience with monitoring tools (New Relic, Splunk Nagios, Zabbix, Dynatrace, SCOM, etc.)
- Understanding of ITSM tools (ServiceNow preferred) for incident, change, and problem management. Rubrik backup management tool.
Operational Skills
- Incident management and escalation handling in 24x7 environments
- Strong troubleshooting and analytical skills
- Ability to correlate infrastructure, network, and application issues
- Strong communication and coordination skills
- Ability to work under pressure in critical outage scenarios
- Good documentation and reporting skills
Preferred Qualifications
- ITIL Foundation certification
- Experience in large-scale enterprise or MSP environments
- Exposure to cloud or hybrid environments (AWS/Azure) is a plus.
Shift Requirement
- 24x7 rotational shifts (including weekends and on-call support)