Role: Data Engineer/Analyst
Location: Irving, TX (Day 1 onsite)
Duration: 12+ Months
Role Overview
We are seeking a highly skilled Data Engineer to lead the architecture, development, and optimization of our end-to-end data pipelines. A primary focus of this role will be driving our on-premise to cloud migration strategy, ensuring a seamless transition of legacy systems into a modern Google Cloud Platform ecosystem. You will be responsible for deep-dive data analysis, building robust ETL/ELT processes, and delivering actionable insights through advanced reporting.
Key Responsibilities
· Design and implement high-throughput streaming architectures using Kafka to capture and process event-driven data
· Lead the end-to-end migration of complex datasets and workloads from Hadoop, Hive, and Teradata to Google Cloud Platform (Google Cloud Platform)
· Build and maintain robust ETL/ELT pipelines using Python and PySpark, ensuring seamless integration of both batch and streaming data
· Conduct deep-dive SQL analysis to ensure data quality and consistency throughout the migration lifecycle and across JSON-based data structures
· Collaborate with stakeholders to deliver advanced ETL reporting, turning raw streams into actionable dashboards and performance metrics
· Perform complex SQL analysis to validate data integrity, identify patterns, and troubleshoot performance bottlenecks across distributed systems
· Develop automated ETL workflows using Python and PySpark, transforming raw data into structured formats for business intelligence and executive reporting.
· Exceptional collaboration skills, with the ability to mentor junior engineers and communicate technical concepts to non-technical stakeholders
Technical Stack & Tools
· Cloud Platform: Google Cloud Platform (Google Cloud Platform) – specifically BigQuery, Dataflow, Cloud Functions, GCS, and Cloud Composer.
· Data Processing: Python (Expert), PySpark, Apache Spark.
· Streaming & Messaging: Apache Kafka (Real-time architecture design).
· Legacy Ecosystems: Hadoop (HDFS), Hive, and Teradata.
· Data Languages: Advanced SQL (Optimization, Window Functions, Performance Tuning).
· Data Formats: handling and parsing complex JSON and semi-structured data.
Required Qualifications & Experience
· Bachelor’s degree and five or more years of work experience
· Hands-on experience with Apache Kafka to design and implement real-time streaming architectures
· Proven experience leading large-scale migration strategies, successfully moving complex workloads and multi-terabyte datasets from Hadoop, Hive, and Teradata environments to Google Cloud Platform (Google Cloud Platform)
· Proven experience with Google Cloud Platform (BigQuery, Cloud Functions, Dataflow, and GCS)
· Mastery of Advanced SQL to perform deep-dive analysis, troubleshoot performance bottlenecks in distributed systems, and ensure 100% data integrity during migration phases
· Expert proficiency in Python and PySpark for building scalable ETL/ELT pipelines that seamlessly unify batch and streaming data streams
· Strong background in Hadoop, Hive, and Teradata to facilitate smooth legacy transitions.
· Expertise in handling and parsing JSON data for large-scale ingestion optimized schemas for downstream consumption and cloud-native storage
· Comprehensive understanding of end-to-end data lifecycles, from raw ingestion to the delivery of actionable dashboards and performance metrics
· Demonstrated ability to act as a bridge between technical and non-technical stakeholders, with a focus on mentoring junior talent and fostering a collaborative engineering culture
Preferred Qualifications
· Relevant Cloud Certifications (e.g., Google Cloud Platform Professional Data Engineer or AWS Certified Data Engineer).
· Experience using AI-assisted development tools (e.g., GitHub Copilot, Gemini) to accelerate delivery cycles and optimize code performance.