Job Summary
We are seeking a skilled Datadog Implementation Engineer to design, implement, and manage end-to-end observability and monitoring solutions across on‑premises and cloud-based infrastructure. The role will focus on building robust monitoring for infrastructure, applications, and services, with a strong emphasis on SAP systems, services, and background jobs. The ideal candidate will enable proactive monitoring, rapid incident detection, and performance optimization through Datadog.
Key Responsibilities
Datadog & Observability Implementation
Design and implement Datadog observability solutions including Infrastructure Monitoring, APM, Logs, Network Monitoring, and Synthetic Monitoring.
Configure Datadog agents for on‑premise, hybrid, and cloud environments (AWS, Azure).
Create and maintain custom dashboards, monitors, alerts, and SLOs aligned with business and operational KPIs.
Implement tagging strategies, metrics normalization, and alert tuning to reduce noise and improve signal quality.
Infrastructure & Cloud Monitoring
Monitor servers, VMs, containers (Docker/Kubernetes), databases, and middleware across environments.
Enable monitoring for Linux and Windows systems, storage, networking, and virtualization platforms.
Integrate Datadog with cloud-native services such as EC2, Azure VM, RDS, SQL DB, Load Balancers, and Kubernetes clusters.
SAP Monitoring & Integration
Implement monitoring for SAP systems including:
SAP ECC
SAP HANA databases
SQL Server Databases
SAP application servers
Monitor SAP services, processes, and background jobs for availability, performance, and failures.
Enable alerting for critical SAP job failures, long-running jobs, system bottlenecks, and resource constraints.
Integrate SAP monitoring data into Datadog dashboards for unified observability.
Logs, APM & Troubleshooting
Configure log ingestion, parsing, and correlation across infrastructure and applications.
Implement APM tracing for supported applications and services to identify latency and performance issues.
Support root cause analysis (RCA) during incidents using metrics, logs, and traces.
Collaborate with application, SAP, infrastructure, and cloud teams to resolve performance and availability issues.
Automation, Security & Best Practices
Automate monitoring deployment using IaC and configuration management tools (Terraform, Ansible, scripts).
Integrate Datadog with ITSM and incident management tools (ServiceNow, PagerDuty, Opsgenie, etc.).
Follow observability best practices for scalability, security, and cost optimization.
Document monitoring standards, runbooks, and operational procedures.
Required Skills & Qualifications
Technical Skills
Strong hands-on experience with Datadog implementation and administration
Experience monitoring on‑premise, cloud, and hybrid environments
Knowledge of SAP architecture, SAP services, and background job monitoring
Experience with Linux/Unix and Windows systems
Understanding of cloud platforms: AWS, Azure, or Google Cloud Platform
Experience with containers and orchestration: Docker, Kubernetes
Working knowledge of:
Metrics, logs, traces, and SLO/SLA concepts
APIs, scripting (Python, Bash, PowerShell)
CI/CD and automation tools