⨠AI Insights & Summary
Reddit is seeking a Staff Site Reliability Engineer to play a pivotal role in enhancing the reliability and scalability of its advertising platform. This is a high-impact position for an experienced SRE looking to provide technical leadership and shape the future of critical revenue-generating systems. With the flexibility to work remotely from the UK, Netherlands, or Ireland, this role offers a unique opportunity to contribute to a globally recognized platform while enjoying a comprehensive benefits package and a culture that values diversity and inclusion.
About Reddit
Reddit is a dynamic "community of communities" platform, hosting authentic conversations across a vast range of shared interests. With millions of daily active visitors and over 100,000 active communities, Reddit is a significant source of online information and a growing advertising powerhouse.
The Ads Organization
The Ads organization powers Reddit's advertising platform, enabling advertisers to connect with engaged communities and driving Reddit's business growth. The reliability of these systems is paramount to advertiser success, revenue generation, and user experience.
The Ads Reliability Team
This team partners closely with Ads Engineering to enhance reliability, scalability, operational excellence, and developer productivity across Reddit's advertising ecosystem.
The Role: Staff Site Reliability Engineer
We are looking for a Staff Site Reliability Engineer to provide technical leadership for reliability initiatives across the Ads organization and help shape the future of Ads infrastructure at Reddit.
What You'll Do
- Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
- Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
- Drive architecture reviews and influence technical decisions impacting critical revenue-generating systems.
- Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.
- Participate in on-call rotations, lead complex incident investigations, and coordinate cross-functional response efforts during major production events.
- Identify systemic reliability risks and drive long-term solutions that improve platform resilience.
- Establish reliability metrics around advertiser-critical user journeys such as campaign creation, ad delivery, auction participation, reporting, attribution, and billing.
- Mentor engineers and provide technical leadership across multiple teams.
- Influence roadmap planning and ensure reliability considerations are incorporated into product and infrastructure investments.
What We're Looking For
- 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
- Strong experience supporting high traffic, user-facing production environments.
- Deep understanding of distributed systems, networking, Linux systems, cloud-native architectures.
- Experience designing highly available systems with strong operational and reliability practices.
- Strong understanding of observability systems including metrics, logging, tracing, and alerting.
- Good programming skills in languages such as Go, Python, or similar.
- Experience improving reliability through SLOs, automation, incident management, and performance optimization.
- Demonstrated ability to troubleshoot complex issues across a modern distributed system stack.
- Strong collaboration and communication skills with the ability to influence technical direction across teams.
Nice to Have
- Experience supporting advertising technology platforms or other large-scale revenue-critical systems.
- Deep understanding of reliability challenges associated with ad-serving, real-time auctions, budget pacing, campaign delivery, measurement, attribution, or billing systems.
- Experience operating high-QPS, low-latency services where latency directly impacts business outcomes.
- Experience establishing reliability programs that deliver meaningful, measurable business outcomes.
- Experience with Kubernetes, cloud infrastructure, and large-scale distributed systems.
- Familiarity with Kafka, ClickHouse, Spark, Flink, BigQuery, or similar large-scale data platforms.
- Experience partnering with Product, Data Science, and Ads Engineering organizations.
- Experience supporting machine learning inference or recommendation systems at scale.
Location
- Flexible first workforce: Remote from anywhere in the UK, the Netherlands, or Ireland.
Benefits
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support.
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Private Medical, Dental, and Vision Benefits
- Personal Retirement Savings Account with matching contribution
- Cycle to Work and Tax Saver schemes
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
AI and Data Privacy Notice
In select roles and locations, interviews may be recorded, transcribed, and summarized by AI. You will have the opportunity to opt out prior to any scheduled interviews. Personal information collected includes Identifiers, Professional and Employment-Related Information, Sensory Information (audio/video recording), and other shared information. This data is used to evaluate your application and will not be sold or disclosed for marketing purposes. Recordings will be deleted promptly after a hiring decision. Please refer to the Candidate Privacy Policy for more details.