Data Engineer

Hybrid in Irving, TX, US • Posted 11 hours ago • Updated 10 hours ago
Full Time
On-site
$125,000 - $140,000/yr
Fitment

Dice Job Match Score™

👤 Reviewing your profile...

Job Details

Skills

  • Data Engineer
  • AWS
  • Databrics
  • Pyspark

Summary

Experience: 8 - 10 years 
 
Job Title: Databricks Data Engineer
 
We are seeking a highly skilled and motivated Databricks Certified Engineer to design, build, and optimize scalable data pipelines and ETL workflows using the Databricks Data Intelligence Platform. The ideal candidate will be responsible for writing robust Python and Spark code, ensuring data quality, and implementing data governance across cloud environments (AWS, Azure, or Google Cloud Platform). This role requires expertise in large-scale data processing, data warehousing principles, and cloud-native solutions.
Roles & Responsibilities:
Pipeline Development: Design, build, and maintain scalable ETL/ELT data pipelines using PySpark, Delta Lake, Auto Loader, and Databricks Workflows.
Data Transformation & Processing: Design and process batch and streaming data to support the Medallion Architecture (Bronze, Silver, Gold layers).
Data Governance & Security: Implement access controls and data masking policies using Unity Catalog to secure Personally Identifiable Information (PII) and ensure compliance.
Performance Tuning: Optimize Spark jobs, troubleshoot memory bottlenecks, and adjust cluster configurations for cost and compute efficiency.
Proactive Risk Identification: Proactively identify and address underlying data complexities, hidden challenges, and potential risks within data pipelines and the Databricks ecosystem, ensuring robust, secure, and efficient data solutions.
Cross-Functional Collaboration: Partner with Data Scientists and Analysts to curate datasets, support machine learning models (MLflow), and provide integrated reporting.
Develop and maintain comprehensive documentation for data pipelines, data models, and ETL processes.
Participate in code reviews to maintain high-quality code standards.
Troubleshoot and resolve issues in data pipelines and Databricks clusters.
Qualifications:
 
Primary Skill Set:
o Databricks Platform Expertise: In-depth knowledge of the Databricks Data Intelligence Platform, including notebooks, Delta Lake, MLflow, Unity Catalog, Auto Loader, and Databricks Workflows.
o Databricks Certification: Relevant Databricks certification (Associate or Professional level) validating foundational or advanced skills in the platform.
Secondary Skill Set:
o PySpark: Strong proficiency in developing complex data transformations and analytics using PySpark.
o Apache Iceberg: Experience with Apache Iceberg for open table format management.
Programming Languages:
o Python: Expert-level proficiency in Python for data manipulation, scripting, and application development.
o SQL: Advanced proficiency in SQL for data querying and manipulation.
o Shell Scripting: Experience with shell scripting for automation and job orchestration.
Cloud Platforms: Hands-on experience with Databricks deployed on major cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (Google Cloud Platform).
Big Data Concepts: Deep understanding of distributed computing, data warehousing principles, ETL/ELT processes, and data modeling.
Good to Have Skills
DevOps Basics: Familiarity with CI/CD tools (e.g., Databricks Asset Bundles, GitHub Actions, GitLab) and orchestration tools like Apache Airflow.
Data Warehousing: Knowledge of Hive for data storage and querying.
Container Orchestration: Familiarity with Kubernetes for deploying and managing containerized applications.
Version Control: Experience with Git or other version control systems.
Databricks Certification Levels
Depending on seniority, candidates may possess different levels of Databricks credentials:
Associate Level: Validates foundational skills in writing Spark code, building SQL queries, and utilizing the Databricks workspace.
Professional Level: Validates advanced skills for production environments, focusing on complex streaming workloads, CI/CD, data governance (Unity Catalog), and high-level performance optimization.
 
Job Title: Pyspark Data Engineer
 
 
We are seeking a highly skilled and motivated Data Engineer to play a pivotal role in designing, building, and optimizing our next-generation scalable data pipelines. This position requires expertise in processing massive datasets using cutting-edge technologies like Apache Spark, PySpark, and Hive within a dynamic cloud environment. Your primary objective will be to ensure the utmost data reliability, speed, and efficiency, providing a robust foundation for downstream business intelligence and advanced analytics initiatives.
 
Roles & Responsibilities:
Data Pipeline Development & Maintenance: Design, build, and maintain highly scalable and efficient ETL/ELT data pipelines utilizing PySpark and Spark SQL for complex data transformations.
Cloud Data Infrastructure Management: Deploy, manage, and scale critical data infrastructure components on leading cloud platforms such as Amazon Web Services (AWS) (e.g., EMR, Glue), Microsoft Azure (e.g., Databricks, Synapse), or Google Cloud Platform (Google Cloud Platform).
Data Warehousing & Storage Optimization: Strategically manage data layout, partitioning, and indexing within Apache Hive and various cloud data lake solutions to optimize performance and accessibility.
Performance Tuning & Optimization: Proactively identify and resolve performance bottlenecks in Spark jobs, leveraging Spark UI for in-depth analysis, effectively managing data skewness, and optimizing memory utilization.
Diverse Data Integration: Develop robust solutions for ingesting high-volume and diverse datasets from both structured relational databases and unstructured flat files into our data ecosystem.
Automated Workflow Orchestration: Implement and manage automated data workflows using industry-standard scheduling tools like Apache Airflow or platform-native schedulers, ensuring timely and reliable data delivery.
Strategic Collaboration: Partner closely with data scientists, business analysts, and cross-functional enterprise teams to translate complex business requirements into technically sound and efficient data solutions.
 
Qualifications:
 
Big Data Frameworks Expertise: Demonstrated high proficiency in Apache Spark architecture, including a deep understanding of drivers, executors, and Directed Acyclic Graphs (DAGs).
Advanced Programming: Exceptional coding skills in Python and extensive experience with the PySpark API for developing intricate data transformations and processing logic.
Querying & Schema Management: Strong command of HiveQL and ANSI SQL, coupled with expertise in data partitioning techniques and effective schema definition.
Optimized Storage Formats: In-depth understanding and practical experience with optimized big data storage file formats such as Parquet, ORC, and Avro.
Cloud Ecosystem Development: Hands-on development experience utilizing cloud-native big data utilities (e.g., AWS EMR, Azure Databricks) with in major cloud platforms.
Data Warehousing Fundamentals: Solid foundation in Dimensional Data Modeling, including Star and Snowflake schemas, and practical experience with Data Lakes concepts and implementation.
Preferred Qualifications
CI/CD & DevOps Automation: Experience with Continuous Integration/Continuous Deployment (CI/CD) practices and automation tools like Git, Jenkins, or Ansible.
NoSQL Database Integration: Exposure to and experience with NoSQL databases such as HBase, Cassandra, or MongoDB.
Professional Cloud Certifications: Relevant professional cloud certifications (e.g., AWS Certified Data Engineer, Microsoft Certified: Azure Data Engineer Associate) are highly valued.
 
Job Title: AWS Data Engineer
 
We are seeking a highly skilled and motivated AWS Certified Engineer to design, build, and optimize scalable data solutions within the Amazon Web Services (AWS) ecosystem. The ideal candidate will have strong expertise in big data processing using PySpark and a deep understanding of data warehousing concepts, including Hive and modern table formats like Iceberg. This role involves developing, deploying, and managing robust, efficient, and secure data pipelines and analytics solutions on AWS, leveraging core networking and compute services.
 
 
Responsibilities:
AWS Solution Design & Implementation: Design, develop, and deploy scalable and cost-effective data solutions on AWS, leveraging services such as S3 (for data lakes), EC2, EMR, Glue, Athena, Lambda, Redshift, and Kinesis.
Data Pipeline Development: Build and maintain robust ETL/ELT data pipelines using PySpark for data ingestion, transformation, and loading into various data stores, including those utilizing open table formats like Iceberg.
Big Data Processing: Develop and optimize big data processing jobs using PySpark on AWS EMR or AWS Glue, handling large datasets efficiently and integrating with Iceberg table formats.
Data Warehousing: Design, implement, and manage data warehousing solutions, including schema design, data modeling, and query optimization, with a focus on Hive and modern data lake table formats like Iceberg for historical data and analytical queries.
Cloud Infrastructure & Networking: Implement secure and robust cloud infrastructure components, including VPCs, subnets, routing, and security groups, to ensure proper connectivity and isolation for data solutions.
Containerized Workloads: Design, deploy, and manage containerized data processing applications on Amazon Elastic Kubernetes Service (EKS).
Performance Tuning & Optimization: Optimize AWS resources and big data applications (Spark, Hive, Iceberg) for performance, cost, and efficiency.
Data Governance & Security: Implement best practices for data security, access control, and compliance within AWS, including IAM policies, S3 bucket policies, and encryption.
Monitoring & Troubleshooting: Set up monitoring, alerting, and logging for data pipelines and AWS infrastructure; troubleshoot and resolve issues promptly.
Automation: Develop and maintain automation scripts using Python and shell scripting for infrastructure provisioning, deployment, and operational tasks.
Collaboration: Work closely with data scientists, analysts, and other engineering teams to understand data requirements and deliver reliable data solutions.
 
Qualifications :
 
AWS Certification: Hold at least one AWS certification (e.g., AWS Certified Solutions Architect Associate, AWS Certified Data Analytics Specialty, AWS Certified Developer Associate).
AWS Services Expertise: Hands-on experience with key AWS services for data processing and storage including:
Storage: S3 (for data lakes), EC2
Data Processing: EMR, Glue, Athena, Lambda
Networking: VPC, Subnets, Routing, Security Groups
Containerization: EKS
 
Big Data Processing: Strong proficiency in PySpark for developing complex data transformations and analytics.
Data Lake Table Formats: Practical experience with Apache Iceberg for managing and querying data lakes.
Data Warehousing: In-depth knowledge and practical experience with Apache Hive for data storage, querying, and schema management.
Programming Languages:
Python: Expert-level proficiency in Python for scripting, data manipulation, and AWS automation (Boto3).
Shell Scripting: Proficient in shell scripting for automation and operational tasks.
Database & SQL: Strong SQL skills for data querying and manipulation.
Data Concepts: Solid understanding of ETL/ELT processes, data modeling, distributed computing, and data governance.
Good to Have Skills
Containerization Orchestration: Experience with Kubernetes for deploying and managing containerized applications.
CI/CD: Experience with CI/CD tools and practices (e.g., AWS CodePipeline, GitHub Actions, GitLab CI) for automating deployment of data solutions.
Orchestration: Experience with workflow orchestration tools like Apache Airflow.
Version Control: Proficient in using Git for source code management.
Other Big Data Technologies: Exposure to other big data technologies like Apache Kafka, Flink, or Presto.
Certifications
AWS Certified Solutions Architect Associate/Professional
AWS Certified Data Analytics Specialty
AWS Certified Developer Associate
 
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10508656
  • Position Id: 9004994
  • Posted 11 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Irving, Texas

3d ago

Easy Apply

Full-time

Depends on Experience

Irving, Texas

6d ago

Full-time

USD 107,120.00 - 160,680.00 per year

Irving, Texas

6d ago

Full-time

USD 125,760.00 - 188,640.00 per year

Hybrid in Dallas, Texas

3d ago

Easy Apply

Full-time

Depends on Experience

Search all similar jobs