**** Please note that the candidates / Consultants need to be on our W2 and we cannot work on C2C for this position*****
Title: Sr. Software Engineer ( Data Engineer)
Duration: 6 Months
Client: Mayo Clinic
Req ID: 37363153
Remote
Scope: The resources will be supporting an engineering team tasked with building a research data platform which will ingest and make discoverable research generated data.
Data Engineering Skills & Experience:
-Create, verify, and maintain data replication scripts
-Create, verify, and maintain data validation, processing, and ingestion pipelines
-Deploy and automate the execution of data replication scripts and data pipelines in cloud infrastructure
-Create and maintain data catalogs that describe datasets and their contents (i.e. files, file types, tables/views, columns, fields, etc.)
-Create, verify, and maintain dashboards and reports that characterize ingested datasets
-Create, verify, and maintain data validation scripts/APIs that verify the production dataset contains the correct number of samples/records, expects values/fields/columns are populated, and values are of the correct data type, format, and range. -Deploy and automate the execution of data validation scripts/APIs
-Create and maintain user documentation (dataset descriptions, tutorials, code examples, etc.)
-Define entitlements, user groups, roles, and permissions utilized to grant access to datasets
Programming Languages:
Primary pipeline development language with be python.
Some datatypes and formats may require the use of other languages (i.e. java, R, etc.) because the libraries/frameworks/sdks available to work with those datatypes and formats are not available in python
Operating Systems:
Primary operating system for data pipeline execution will be linux, with data pipelines packaged, deployed, and run as containers.
Data source systems could be windows or linux based.
Infrastructure:
Primary data platform and data pipeline execution infrastructure will be hosted on Google Cloud Platform (Google Cloud Platform) utilizing cloud native technologies (i.e. Google Cloud Storage, BigQuery, Google Batch, Dataflow, Cloud SQL, etc.).
Data will be replicated from various on-premises sources that include laboratory instruments, network shared drives, and windows desktops attached to instruments.
Development Tools:
Sprints, features, and tasks will be managed in Azure DevOps.
Code will be managed and versioned Azure DevOps based git repositories.
Code will be compiled, packaged, and deployed utilizing Azure DevOps build pipelines.
Data pipelines will be packaged, deployed, and run in docker containers.
Docker containers will be stored and versioned in Google Cloud Artifact Repositories.
Veracode will be utilized to scan source code for vulnerabilities and Prisma Cloud will be utilized to scan containers.
The standard integrated development environment will be jetbrains (pycharm, intellij, etc.) or VSCode.
Preferred Candidates:
-Experience working on healthcare, life science, or scientific research projects
-A degree or domain knowledge in a life science related field (biochemistry, genetics, biology, etc)
-Experience with Google Cloud Platform based infrastructure and services 100% remote.
Mayo will provide equipment.
Education: Bachelor''''''''s Degree in Computer Science/Engineering or related field with 5 years of experience as noted below; OR an Associate''''''''s degree in Computer/Science/Engineering or related field with 7 years of experience