Overview
Skills
Job Details
Title: NOC Engineer
Location- Remote
Key Responsibilities:
● Monitor public cloud infrastructure (compute, storage, networking, and
Kubernetes clusters) using observability tools like Prometheus, Grafana, and internal
dashboards.
● Identify, triage, and respond to real-time alerts and incidents to prevent or minimize
customer impact.
● Perform first-level troubleshooting of system issues, including host failures, degraded
services, and latency incidents.
● Escalate critical issues to CloudOps Engineering, Network Infrastructure, or Security
teams following predefined runbooks and escalation paths.
● Maintain clear documentation of incidents, resolutions, and system changes in the
ticketing system (e.g., Jira, PagerDuty, or internal tooling).
● Write and update operational playbooks to standardize response procedures for cloud
infrastructure issues.
● Collaborate in post-incident reviews with the Network Infrastructure and CloudOps teams
to identify root causes and help implement long-term fixes.
Qualifications:
● 2+ years of experience in a NOC, cloud operations, or system monitoring role, preferably
in a public cloud or SaaS environment.
● Strong understanding of Linux systems, networking concepts (TCP/IP, DNS, VPN, BGP),
and system administration basics.
● Experience working with Juniper and Arista network equipment, including basic
configuration and troubleshooting.
● Familiarity with container orchestration and cloud-native tools (e.g., Kubernetes, Docker)
is a plus.
● Excellent troubleshooting skills and ability to work calmly in high-pressure, time-sensitive
situations.
● Strong communication skills with the ability to write clear incident reports and Cloud
Operations playbooks.
● Experience with services (e.g., Droplets, VPCs, Load Balancers, Spaces)
is highly preferred.
Preferred Qualifications:
● Certifications in Juniper (e.g., JNCIA, JNCIS) or Cisco (e.g., CCNA) technologies.
● Familiarity with Infrastructure-as-Code tools (e.g., Terraform) and CI/CD pipelines.
● Prior experience in high-availability cloud environments and large-scale incident
management.