Role: Scientific Data & Knowledge Engineer (Knowledge Graph Engineer)
Location: Remote
Role Overview
The Scientific Data & Knowledge Engineer is a specialist role at the intersection of data engineering, semantic technologies, and scientific domain knowledge. This individual is responsible for maximising the value of scientific data assets over their lifetime - acting as a translator between domain experts in R&D and the technical data systems that underpin research and discovery.
Working closely with Product Managers and R&D Subject Matter Experts, this role defines the language of science in data - through data models, ontologies, and controlled vocabularies - and ensures that scientific knowledge is structured, indexed, and interoperable across data products. The engineer serves as the voice of the Knowledgebase, championing the value and long-term usability of data assets.
Key Responsibilities
Metadata Harmonisation & Curation
- Lead metadata harmonisation, curation, and large-scale dataset ingestion workflows.
- Design and implement structured, auditable data transformations ensuring traceability and reproducibility.
- Develop and maintain schema-driven automation pipelines (e.g., JSON Schema) to enforce data quality and consistency.
Ontology & Semantic Standards
- Perform ontology alignment and entity normalisation using services such as the Ontology Lookup Service (OLS).
- Develop and maintain vocabularies, ontologies (e.g., RAO), and controlled terminologies in collaboration with scientific SMEs.
- Apply semantic web technologies including RDF/OWL triple stores, SHACL, and LinkML for knowledge representation.
- Leverage knowledge graph and semantic query capabilities (e.g., Neo4j, GraphDB, SPARQL) where applicable.
Data Engineering & Pipeline Delivery
- Engineer robust API and ETL pipelines for scientific data ingestion, transformation, and delivery (e.g., FastAPI, PostgreSQL).
- Implement URI generation strategies and graph embedding machine learning pipelines.
- Execute data engineering workloads on cloud infrastructure, primarily Google Cloud Platform (Google Cloud Platform), BigQuery, and GCS.
- Adopt Infrastructure as Code (IaC) practices for scalable and repeatable platform deployment.
Collaboration & Knowledge Translation
- Partner with Product Managers and R&D scientists to translate complex scientific concepts into robust, fit-for-purpose data models.
- Act as the authoritative voice of the Knowledgebase - ensuring interoperability, reusability, and long-term value of data assets.
- Contribute to and champion data governance standards, documentation, and best practices across the organisation.
- Engage proactively with cross-functional stakeholders to align scientific terminology with technical data product requirements.
Technical Skills & Requirements
Languages & Query
Semantic Technologies
| RDF / Triple Stores | | OWL | | SHACL | | LinkML | | Ontologies (RAO) |
Platforms & Infrastructure
| Google Cloud Platform / BigQuery | | Google Cloud Storage | | Infrastructure as Code | | ETL Processes |
Tools & Technologies
| GitHub / GitLab | | Apache Jena | | Protege | | Jira / Confluence | | FastAPI |
Data Engineering Competencies
- Metadata harmonisation and large-scale structured data ingestion.
- URI generation and entity resolution at scale.
- Graph embedding and machine learning pipeline integration.
- Knowledge graph construction and semantic query optimisation.
Qualifications & Experience
Essential
- Degree in Computer Science, Bioinformatics, Information Science, or a related scientific/technical discipline.
- Demonstrable experience in data engineering with a focus on scientific or research data environments.
- Hands-on expertise with semantic web technologies: RDF, OWL, SPARQL, ontology development and alignment.
- Proficiency in Python and SQL; experience with Scala is advantageous.
- Practical experience with cloud platforms, particularly Google Cloud Platform, BigQuery, and related data services.
- Strong understanding of metadata standards, data harmonisation, and controlled vocabulary management.
- Experience working with knowledge graph technologies (Neo4j, GraphDB, Apache Jena, or similar).
Desirable
- Experience within a pharmaceutical, life sciences, or research-intensive organisation.
- Familiarity with domain-specific ontologies such as RAO, ChEBI, or similar scientific vocabularies.
- Working knowledge of LinkML and SHACL for schema definition and validation.
- Exposure to MLOps practices and graph embedding or ML pipeline delivery.
- Experience with Protege for ontology authoring and management.