Lead Big Data/Data Processing/Pyspark

  • Chicago, IL
  • Posted 15 hours ago | Updated 15 hours ago

Overview

On Site
$45 - $50
Contract - W2
Contract - Independent
Contract - 12 Month(s)

Skills

Amazon S3
Amazon Web Services
Analytical Skill
Big Data
Cloud Computing
Collaboration
Data Flow
Data Modeling
Data Processing
Data Quality
Database
Databricks
Documentation
ELT
Electronic Health Record (EHR)
Extract
Transform
Load
GC
Good Clinical Practice
Google Cloud Platform
Innovation
Machine Learning (ML)
Management
Microsoft Azure
Performance Tuning
PySpark
Quality Assurance
Reporting
Snow Flake Schema
Star Schema
Streaming
Testing

Job Details

Employment : W2
Minimum years of experience: 8+ Years of experience
Job Description:
  • Data Pipeline Development: Design, develop, test, and deploy robust and scalable data pipelines using PySpark for data ingestion, transformation, and loading (ETL/ELT) from various sources (e.g., S3, ADLS, databases, APIs, streaming data).
  • Big Data Processing: Utilize PySpark to process large datasets efficiently, handling complex data transformations, aggregations, and data quality checks.
  • Performance Optimization: Optimize PySpark jobs for performance, efficiency, and cost-effectiveness, identifying and resolving bottlenecks.
  • Data Modeling: Collaborate with data architects and analysts to design and implement efficient data models (e.g., star schema, snowflake schema, data vault) for analytical and reporting purposes.
  • Cloud Integration: Work with cloud platforms (AWS, Azure, Google Cloud Platform) and their respective big data services (e.g., AWS EMR, Azure Databricks, strong understanding of medallion, Google Cloud Platform Dataflow/Dataproc) to deploy and manage PySpark applications.
  • Collaboration: Work closely with data scientists, machine learning engineers, and other stakeholders to understand data requirements and deliver solutions that meet business needs.
  • Testing and Quality Assurance: Implement comprehensive unit, integration, and end-to-end tests for data pipelines to ensure data accuracy and reliability.
  • Monitoring and Support: Monitor production data pipelines, troubleshoot issues, and provide ongoing support to ensure data availability and integrity.
  • Documentation: Create and maintain clear and concise documentation for data pipelines, data models, and processes.
  • Innovation: Stay up-to-date with the latest advancements in big data technologies, PySpark, and cloud services, and recommend new tools and approaches.
Additional Skills:
  • Data Pipeline Development Design, develop, test, and deploy robust and scalable data pipelines using PySpark for data ingestion, transformation, and loading ETLELT from various sources e.g., S3, ADLS, databases, APIs, streaming data.
  • Big Data Processing Utilize PySpark to process large datasets efficiently, handling complex data transformations, aggregations, and data quality checks.
  • Performance Optimization Optimize PySpark jobs for performance, efficiency, and cost-effectiveness, identifying
Thanks & Regards
B. Koushik
Talent Acquisition
Direct: +1
Phone: +1 ext: 229
I can be reached between 9:00 am EST - 5:30 pm EST
Note: This is not unsolicited mail. If you are not interested in receiving our e-mails then please reply with "Remove" in the subject line.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.