Data Engineer Python 9462891
The Research and Early Development (gRED), Early Clinical Development Informatics (ECDi) department is seeking an experienced Data Engineer who will be responsible for designing, developing and optimizing ETL / data pipelines to support a variety of machine learning, predictive analytics, systems and BI solutions in support of the organization's goals to digitize and optimize clinical trials.
This individual will work within ECDi's Information Management Office (IMO).
The role will require cross-functional interactions with Data Management Leads, Predictive Analytics Analysts, Artificial Intelligence Scientists and Information Technology teams across multiple projects to implement data solutions in ECDi's data lake and data warehouse called gCORE.
The hallmark of a great candidate is one who can translate the unique needs of a diverse set of stakeholders and requirements across both the data lake and data warehouse use cases and is eager to solve complex data challenges selecting the best fit solution.
Must be self-motivated, passionate about data management and analytics and able to extrapolate customer needs with minimal direction.
Understand the current state data landscape, use cases and existing data lake and data warehouse setup
Work with Business Analysts, Data Analysts, Data Scientists and AI Engineers to identify infrastructure and data roadmap needs and propose the appropriate strategy in partnership with other IMO engineers
Assemble large, complex data sets in the format fit for each use case
Architect, develop and optimize ETL pipelines using Python, Spark, EMR, Docker and Airflow
Develop and optimize big data pipelines for data scientists (requires a basic understanding of data science concepts and ML)
Write generic Python/Pyspark modules for processing data from various data sources (XML, Parquet, CSV, Relational)
Hands on physical and logical database design and modeling in the context of data warehousing (currently using AWS Redshift)
Perform hands-on infrastructure design of ECD's AWS data lake and data warehouse environment (gCORE) including continuous exploration and recommendation of new technologies and best practices;
Research and recommend new innovative methods and systems to manage data for business improvement;
Participate in internal governance to drive the data quality business cycle and roadmap
5+ years of programming experience (including functional programming);
must be advanced in Python
3+ years experience designing, building and maintaining production data pipelines and/or data warehouses
Demonstrable experience working with different database types including columnar data stores, SQL and graph based and the ability to select the right tool for the right job
Experience building and optimizing big data pipelines using Spark
Experience with AWS cloud services: S3, EC2, EMR, RDS, Redshift, Lambda, EKS
Solid understanding of how to design robust data workflows including optimization and user experience
Strong analytical and problem-solving skills
Excellent oral and written communication skills
Able to work in teams and collaborate with others to clarify requirements
Strong coordination and project management skills to handle complex projects
Experience developing and working with XML, JSON, and external web services
Clinical drug development domain knowledge
Experience working with clinical and biomedical data types (clinical patient data, omics, imaging, etc.)
Competencies in applied statistics to solve business needs
Knowledge of industry data standards used in drug development, particularly in Clinical development
Bachelor's or Master's degree in computer science or software engineering
AMAZON ELASTIC COMPUTE CLOUD