Overview: At a high level, they are automating workflows in various areas of their Media Operations. As part of media campaigns, they pay Google & Meta to bring traffic to their site and need a way to track the effectiveness of the traffic hits they're getting. Google/Meta have options to send feedback on users to them (i.e. how "useful" was this user). This is easier for an ecommerce company to measure - this user spend $x on our products after being directed from Google/Meta and thus was useful/not useful to us. RVO has a much less clear-cut means to provide this feedback. They essentially need to determine whether the user who was directed to them has the particular disease for which the advertisement was shown (was it relevant to this specific user?) - measuring the prevalence of the disease they have to determine if the campaign is effective. Their audience quality data comes from a third party and is received in PDF, email, unstructured data sources that are very messy. They are developing an AI-automated solution to process and ingest that data into a tabular format in Databricks (part 1); and then they are building classic data science models to determine what users are doing on their site at an aggregate level and build a model that compares on-site behavior to audience quality (part 2).
Their tech stack is: Databricks for housing and processing data, AWS for foundation models and AI orchestration (Lambdas), Python, Jira, Asana, Snowflake is used in other parts of the org, but not being used for this project, Cursor or Claude Code for code development/enhancement.
Data Engineering experience is not required, may be good to have for the nature of the agents they are building, but the data engineering side will build out the actual schemas, fact and dimension tables in Databricks, help them understand the partitioning and indexing.
They should have worked with large datasets (billions of rows, roughly 100s of GBs or several TBs of data).
My immediate goal is to build an automated process, using AI native tools, be able to process the messy data from unstructured formats and process, load into Databricks tables. Future state would be to where they could go directly from PDF, extract the necessary data and load it into a table.
Want to deploy on-site and in real time - will involve integration with engineering team and in house tech stack. This role will not be responsible for any platform components, or for building APIs to deploy the models into production. Their goal will be to get to a deployable model so more Data/AI Science in that sense, but they do not want code only in notebooks - should be production-grade code.
Background will be a blend of agentic AI experience and traditional data science.
AI agent will need to:
Extract key fields from the unstructured sources (PDF, email, attachment, etc) such as audience quality metrics, disease prevalence rates, campaign performance data and convert into structured schema (Databricks tables).
Standardize and clean data - normalize formats across document types, resolve naming inconsistencies, handle missing data/missing fields
Enrich data - joining extracted data with campaign data, Google/Meta data, audience demographic data from 3rd party.
Must Have:
Python
Agentic AI experience
LLMs (Bedrock foundation models)
Data science modeling experience
Databricks or Snowflake
Experience working with large datasets (100s of GBs to TBs, billions of rows)
AWS (Lambdas for orchestration)
Nice to Have:
Document AI/OCR familiarity
Experience in data engineering - unstructured data processing, data cleaning, standardization, joining/enriching datasets