Staff Data Engineer - Data Lake
At H1, we are driven by the mission to make the best healthcare information universally accessible. We leverage data and AI to unlock medical insights, improve patient outcomes, and accelerate equitable drug development. Learn more about us at h1.co.
Our Data Engineering team is responsible for developing and delivering our most critical asset: our data. We manage thousands of global data sources, ensuring accuracy, normalization, and timely delivery to meet real-world changes and expanding market demands.
As a Staff Data Engineer on the Data Lake team, you will play a pivotal role in shaping the architecture, scalability, reliability, and long-term direction of our core data platform. This role is designed for a highly technical engineer eager to grow into an Engineering Manager track while remaining deeply hands-on.
The Data Lake is the foundation of H1’s platform, ensuring the validation, accuracy, standardization, and quality of data that powers all downstream products and teams. You will help lead the evolution of this platform while supporting and mentoring a growing team of engineers.
What You'll Do:
- Architect, build, and scale distributed ETL/ELT pipelines and large-scale ingestion frameworks across structured and unstructured healthcare datasets.
- Lead the evolution of H1’s Data Lake architecture, focusing on scalability, observability, reliability, and cost optimization.
- Own and improve data quality, validation, normalization, and standardization workflows for thousands of global data sources.
- Design and optimize batch and near real-time data processing frameworks using cloud-native distributed systems.
- Optimize distributed compute and storage systems, including Spark workloads, query performance, partitioning strategies, and infrastructure efficiency.
- Drive improvements in monitoring, governance, operational excellence, and production reliability across the platform.
- Troubleshoot complex production data and infrastructure issues across distributed systems.
- Partner closely with Product, Infrastructure, Security, Compliance, and downstream engineering teams to ensure scalable and secure data delivery.
- Mentor engineers through technical leadership, architecture reviews, and best practices.
- Help define technical roadmap priorities and contribute to long-term platform strategy and execution planning.
- Support production operations, incident response, and platform health as part of overall ownership of the Data Lake ecosystem.
About You:
You are a highly technical data engineer who thrives in lean, high-ownership environments and enjoys solving complex distributed systems challenges. You are excited by the opportunity to influence technical direction, mentor engineers, and grow into broader engineering leadership responsibilities while remaining hands-on.
- Deep experience designing and scaling distributed data platforms and large-scale pipelines in cloud-native environments.
- Expertise in building reliable, observable, and maintainable data systems supporting critical business and analytics workloads.
- Strong expertise in distributed processing, performance optimization, and modern data architecture patterns.
- Comfortable leading technical initiatives and influencing architecture decisions across teams.
- Effective communication with both technical and non-technical stakeholders.
- Enjoy mentoring engineers and raising the engineering bar.
- Energized by ownership, autonomy, and solving ambiguous technical challenges.
Requirements:
- 8+ years of experience in data engineering, software engineering, or related fields with significant experience building and scaling distributed data platforms.
- Demonstrated technical leadership experience with interest in or experience mentoring and leading engineers.
- Strong proficiency in Python (PySpark), Java, Scala, or similar programming languages.
- Advanced SQL expertise, including performance tuning and optimization across large datasets.
- Deep experience with Apache Spark and cloud-native big data platforms, preferably within AWS environments (EMR, Glue, S3, Athena, Redshift, or similar).
- Experience designing and scaling modern cloud-native data lake architectures and large-scale ingestion frameworks.
- Experience with orchestration and workflow management tools such as Argo, Airflow, or similar.
- Strong understanding of distributed storage systems, partitioning strategies, and file formats (Parquet, Avro, ORC).
- Experience with Docker, Kubernetes, and modern containerization technologies.
- Experience implementing monitoring, observability, and data quality frameworks in production environments.
- Experience with large-scale data cleaning, parsing, normalization, and validation workflows preferred.
- Experience with healthcare, life sciences, publication, or large-scale entity-resolution datasets preferred.
- Exposure to ML/AI-driven data enrichment, parsing, or validation workflows is a plus.
- Experience using AI-assisted coding tools (e.g., GitHub Copilot, Claude Code) to accelerate development is encouraged.
Compensation:
- $170,000 to $190,000 per year, based on experience, plus stock options.
H1 Offers:
- Comprehensive health insurance options and generous paid time off.
- Pre-planned company-wide wellness holidays.
- Retirement options.
- Health & charitable donation stipends.
- Impactful Business Resource Groups.
- Flexible work hours and remote work opportunities.
- The opportunity to work with leading biotech and life sciences companies in an innovative industry with a mission to improve healthcare globally.
H1 is proud to be an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive workplace. We provide reasonable accommodation to applicants with disabilities. We may use AI tools to assist in the hiring process, but final decisions are made by humans.