SRE Engineer Data Analytics

Overview

On Site
$$60/HR on W2
Full Time

Skills

GC
EDA
Scalability
Core Data
Operational Excellence
FOCUS
Analytics
Continuous Integration
Continuous Delivery
GitHub
Jenkins
Provisioning
Terraform
Operational Efficiency
Optimization
Test Plans
Performance Testing
Incident Management
ITIL
IT Service Management
ServiceNow
Change Management
Root Cause Analysis
Documentation
Service Level
Budget
AppDynamics
Dashboard
Dynatrace
Kibana
Data Analysis
Performance Tuning
Computer Cluster Management
Workflow
Orchestration
Data Flow
Access Control
Regulatory Compliance
TLS
SSL
Computer Science
DevOps
Reliability Engineering
IaaS
Microsoft Azure
Python
Scripting
Bash
Configuration Management
Ansible
Docker
Kubernetes
Linux
Computer Networking
TCP/IP
DNS
Dragon NaturallySpeaking
Load Balancing
Cloud Computing
Amazon Web Services
Remote Desktop Services
Amazon RDS
NoSQL
Database
Databricks
Informatica
Microsoft Power BI
Communication
Management

Job Details

Looking for Washington DC local Candidates and W2 only.

Visa : GC-EAD, EDA

The Site Reliability Engineer (SRE) for Data Analytics is a critical mid-level role focused on applying robust SRE and DevOps principles to ensure the stability, performance, and scalability of our client's core data platforms. This role will drive operational excellence by automating CI/CD pipelines and infrastructure (IaC), leveraging advanced observability tools like Dynatrace, and leading incident response for key systems including Databricks, Informatica, and Power BI. The successful candidate will have 2-4 years of experience, a passion for automation, strong cloud skills (AWS/Azure), and a dedicated focus on maintaining high service reliability (SLIs/SLOs) for a critical Data & Analytics ecosystem in a fast-paced environment in the DC area.

Responsibilities Deployment & Automation
  • Implement and maintain CI/CD pipelines using tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
  • Automate infrastructure provisioning and management using Infrastructure-as-Code (IaC) with Terraform, CloudFormation, or AWS CDK.
  • Develop robust automation scripts and self-service tooling to minimize toil and enhance operational efficiency.
Capacity, Performance & Cost Optimization
  • Lead and implement operational cost optimization initiatives across cloud infrastructure and data platforms.
  • Configure, maintain, and tune auto-scaling policies and performance thresholds.
  • Develop and execute Resiliency Test plans and provide critical support for Performance testing efforts.
Incident Management & SRE Principles
  • Serve as a production on-call responder, employing strong troubleshooting skills to quickly resolve complex incidents.
  • Proficiently utilize ITIL framework concepts and ITSM tools (e.g., ServiceNow) for incident and change management.
  • Develop high-quality Root Cause Analysis (RCA) documentation and Knowledge articles to prevent future recurrence.
  • Implement and enforce SRE principles, including the definition and tracking of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Observability & Monitoring
  • Manage and leverage advanced observability platforms (Dynatrace preferred, AppDynamics, ELK, etc.).
  • Implement distributed tracing with accurate context propagation across data services and applications.
  • Optimize monitoring queries, and configure actionable dashboards, alerts, and anomaly detectors using tools like Dynatrace and Kibana.
Data Analytics Platform Reliability
  • Ensure the reliability, performance tuning, and access control for Databricks cluster management and data pipelines.
  • Maintain Informatica workflow orchestration, connector reliability, and error handling for critical data flows.
  • Manage Power BI gateway health, access control, and ensure reliable data refresh processes.
Security & Compliance
  • Manage service accounts, access permissions, and roles following the principle of least privilege.
  • Create, deploy, and manage digital certificates and TLS/SSL configurations.
  • Execute effective remediation tasks and respond to security incidents as part of the operational team.
Qualifications Education & Experience
  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 2 to 4 years of hands-on experience in a DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure role.
  • Practical, working experience with major cloud platforms, specifically AWS and Azure.
Technical Skills
  • Mid-level proficiency in Python or other scripting languages (e.g., Bash, Go) for automation tasks.
  • Mid-level proficiency with Configuration Management tools, including Ansible.
  • Strong knowledge of containerization technologies (Docker, Kubernetes/ECS).
  • Solid understanding of Linux systems and networking fundamentals (TCP/IP, DNS, Load Balancing).
  • Working knowledge of relational, cloud-native (e.g., AWS RDS), and NoSQL database technologies.
  • Direct hands-on experience supporting and maintaining data platforms like Databricks, Informatica, or Power BI is highly desirable.
Professional Attributes
  • Excellent written and verbal communication skills, with a proven ability to document complex systems.
  • Demonstrated ability to work independently, manage shifting priorities, and drive initiatives to completion.
  • Availability for on-call duties and to work outside of standard business hours as required to support a 24/7 production environment.

Required Skills :

Basic Qualification :

Additional Skills :

This is a high PRIORITY requisition. This is a PROACTIVE requisition

Background Check : No

Drug Screen : No

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.