Overview
Skills
Job Details
Need at least 8 to 10 + Years of experience .
L2 (Tier2 Operations)
Primary Monitoring & Incident Response
Provide 247 monitoring of Azure infrastructure (compute, network, storage) using tools such as Azure Monitor, Splunk, DynaTrace, and custom dashboards.
Respond to alerts and triage P1/P2 escalations via ServiceNow war rooms, performing initial diagnosis and remediation where possible.
Incident / Change / Exception process adherence.
Capacity & Availability Management
Identify scaling opportunities with virtual machines or service as required and identify zoneredundancy patterns for performance.
Keep track of capacity forecasts and proactively identify performance bottlenecks.
Backup & Restore Operations
Execute frequent backups (Azure Backup, NetApp Snapshots) and perform basic restore tasks to ensure business continuity.
Conduct routine backup verifications/tests to confirm data integrity.
Access & Permissions Management
Maintain Azure/NetApp file shares, setting up and adjusting access controls and AD group permissions according to organizational policy.
Perform periodic identity and access reviews to ensure principle of least privilege.
Logging & Metrics Oversight
Oversee monitoring agents (e.g., Splunk, DynaTrace, Azure Alerts, SystemPulse), ensuring they are uptodate and generating the right alerts/metrics for L2 to act upon.
Collaborate with L3 to finetune alert thresholds and logging when chronic issues emerge.
Basic Performance Testing
Execute routine performance checks (e.g., load or stress tests) in coordination with L3 teams when potential service degradation is suspected.
Document and escalate consistent performance anomalies.
SKILL SET & STAFFING CONSIDERATIONS
Comfortable reading and troubleshooting logs/metrics (Splunk, DynaTrace, Azure Monitor).
Familiar with Azure Backup services, basic restore procedures, and file share permissions.
Proficiency in ticketing systems (ServiceNow), collaborating with other technical teams for escalations.
Sufficient knowledge to follow runbooks and standard operating procedures (SOPs).
Documentation of standard operating procedures and IaC changes should be continuously updated in a central repository (e.g., Git repos).
Familiarity with Epic implementations (on-prem / cloud)
L3 (Tier3 Operations)
Advanced Provisioning & OS Management
Use Golden Images for VM provisioning, managing OS patching, and ensuring updates are tested in dev/test environments before production rollout.
Oversee Active Directory and DNS changes for largescale or critical deployments.
System Refreshes, DR, & HighSeverity Issue Handling
Coordinate system refreshes, restore tests, and DR failovers, especially for Epic or other missioncritical applications.
Own P1/P2 escalations when L2 cannot resolve and lead major incident war rooms (rootcause analysis, postincident reviews).
Azure Policies, Security & Network
Enforce Azure Policies and RBAC; manage vulnerability scans (Microsoft Defender, or other tools), patching any discovered weaknesses.
Update and maintain firewall rules, NSGs, or other network security baselines.
Migrations & Decommissioning
Oversee more complex migrations between environments or Azure regions (sometimes involving replatforming or rearchitecting).
Perform advanced data snapshot validations and coordinate system retirement/decommission tasks.
Resiliency Testing & Audits
Plan and execute chaos engineering exercises, including rollback or failback scenarios.
Conduct regular security audits, ensuring that any deviations from compliance are documented and remediated.
Mentoring & Training Responsibilities
Deliver quarterly (or more frequent) training sessions to L2 on new tools, processes, or changes to the environment.
Act as final escalation point for complex or unknown technical issues.
SKILL SET & STAFFING CONSIDERATIONS
Deep knowledge of Azure services, including network configuration, VM management, storage, AD, DNS, and security controls.
Ability to architect and troubleshoot large, complex environments both manually and with automated tools.
Strong scripting or automation capabilities (PowerShell, Azure CLI) for largescale patching or configuration updates.
Experience in incident management, rootcause analysis, and producing postmortem reviews.
Familiarity with Epic implementations (on-prem / cloud)