Platform Reliability Engineer

Overview

On Site
Hybrid
Full Time
Part Time
Accepts corp to corp applications
Contract - W2
Contract - Independent
Contract - Permanent
100% Travel

Skills

Azure
gcp
IAC
platform
Reliability

Job Details


Role : Platform Reliability Engineer

Location: Dallas TX On site

Core Responsibilities

Technical Skills

    • Google Cloud Platform Services: Expert in BigQuery, Dataproc, Dataflow, Data Fusion, Pub/Sub, Cloud Storage, Cloud Run
    • Infrastructure as Code: Advanced Terraform skills with module development
    • Containers & Orchestration: Deep experience with Docker, Kubernetes, and GKE
    • CI/CD: Proficient with Cloud Build and GitLab CI/CD pipelines
    • Data Security: Strong understanding of data governance, security, and compliance
    • Data Processing: Expert in data pipeline design using Apache Spark, Apache Beam
    • Machine Learning Ops: Working knowledge of Vertex AI, AI Platform, TensorFlow/PyTorch deployment
    • Streaming: Experience with Kafka and Pub/Sub architectures
    • API Design: RESTful API design with authentication and authorization patterns
    • Workflow Management: Experience with Astronomer/Airflow for orchestration
    • Data Transformation: DBT implementation experience
    • GitOps: Proficient with GitOps principles and tools (ArgoCD, Flux)

    Soft Skills

    • Strategic thinking and architectural vision
    • Advanced communication and stakeholder management
    • Ability to balance technical excellence with business needs
    • Strong mentorship and knowledge sharing capabilities
    • Collaborative approach to architectural decisions

    Certifications

    • Google Cloud Professional Cloud Architect (Required)
    • Google Cloud Professional Data Engineer (Required)
    • Google Cloud Professional Security Engineer (Recommended)
    • Kubernetes CKA/CKAD (Recommended)
  • Implement automated testing frameworks for infrastructure changes
  • Build APIs for common platform operations with Authentication & Authorization
  • Create sandbox environments with guardrails for experimentation
  • Develop templated project provisioning workflows
  • Create runbooks for common failure scenarios
  • Implement automated scaling policies based on usage patterns
  • Set up cost anomaly detection alerts
  • Implement lifecycle policies for cost-effective data storage
  • Build example implementations and starter kits
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.