Role: AWS Cloud Data Engineer
Location: Boston, MA (Hybrid Onsite)
Industry: Financial Services
About the Role
We are seeking a highly skilled Cloud Data Engineer to design, build, and optimize a modern, scalable Legal Data Lakehouse platform. Operating within State Street's Global Technology Services, you will leverage a deep knowledge of the full suite of AWS cloud services combined with high-performance Databricks capabilities to ingest, model, and secure complex enterprise data structures (including contracts, litigation matters, eDiscovery datasets, and global regulatory feeds).
This role is critical to establishing a single, highly governed, audit-ready source of truth that powers critical legal operations, compliance analytics, and emerging generative AI/ML use cases across our global footprint.
Key Responsibilities:
- Data Lakehouse Engineering & Architecture
- Design, build, and maintain enterprise-grade, custom data pipelines utilizing Databricks (PySpark, Spark SQL, and Scala) on AWS infrastructure.
- Implement and manage a multi-layered Lakehouse architecture (Bronze, Silver, and Gold zones) to curate unstructured contract text, semi-structured logs, and highly structured transactional tables.
- Architect robust end-to-end data ingestion frameworks supporting high-throughput batch and near real-time data flows from on-premises systems and third-party legal platforms.
- Cloud Infrastructure & Platform Optimization
- Utilize the broad suite of AWS services (including but not limited to S3, Lambda, Glue, EMR, Athena, EC2, and CloudWatch) to support and optimize distributed storage and compute infrastructure.
- Conduct advanced performance tuning on large-scale Apache Spark workloads optimizing partitioning, indexing, caching strategies, and Databricks cluster utilization to manage cloud run costs efficiently.
- Automate deployment configurations, orchestrate multi-dependency workflows (via Databricks Jobs/Workflows, Airflow, or Autosys), and build containerized solutions using Docker.
- Data Governance, Security & Compliance
- Enforce strict, fine-grained access controls, row/column-level security, and data classification strategies using Databricks Unity Catalog integrated with AWS IAM and enterprise identity providers.
- Ensure all data pipelines and lakehouse layers remain strictly compliant with global data privacy regulations (e.g., GDPR) and rigid internal financial audit standards.
- Implement end-to-end data lineage tracking, validation frameworks, and automated reconciliation routines to preserve absolute data integrity for legal and regulatory reporting.
- Downstream Integration & Innovation
- Collaborate with business analysts and legal operations to expose curated datasets via secure APIs and optimized connectors.
- Enable seamless consumption of financial and legal analytics through integration with visualization tools like Power BI or automation platforms (Power Apps / Power Automate).
- Support data readiness for advanced AI/ML models, contract intelligence tools, and eDiscovery search workflows.
Required Skills & Qualifications
Core Technical Skills:
- Databricks & Spark: 3+ years of deep, hands-on experience building, scheduling, and debugging data pipelines on Databricks utilizing PySpark, Scala, or Spark SQL.
- AWS Cloud Suite: Extensive knowledge of AWS core services, with deep familiarity across object storage (S3), serverless compute (Lambda), data cataloging/ETL (Glue), access management (IAM), and encryption (KMS).
- Data Modeling: Strong proficiency in relational database design, data warehousing structures, schema evolution, and performance tuning techniques (e.g., Delta Lake formats, Apache Iceberg).
- Programming & Scripting: Strong coding skills in Python and advanced SQL are mandatory.
- CI/CD & Devops: Proven familiarity with version control (Git) and standard automated deployment workflows.
Domain & Professional Value-Adds:
- Regulated Industries: Experience in Financial Services, Asset Management, or handling highly sensitive, audit-driven data environments is highly preferred.
- Legal Data Concepts: Familiarity with legal data constructs such as contract clauses, corporate matter management, or metadata extraction is a significant advantage.
- Ownership Mindset: Excellent communication skills, with a track record of collaborating across global, distributed engineering and business architecture teams.