β¨ AI Insights & Summary
Docker, a cornerstone in developer tooling, is revolutionizing software development with its platform and embracing the AI-driven future. This Staff Engineer role represents a significant opportunity to shape the foundational infrastructure that powers millions of developers worldwide. By driving self-service systems, enhancing multi-region capabilities, and pioneering AI-assisted operations, you will play a pivotal role in building a more reliable, secure, and efficient platform. If you are a seasoned infrastructure leader passionate about enabling developer productivity and are excited by the prospect of defining the next generation of cloud platforms, this is a compelling career move.
Staff Engineer, Platform
Docker is seeking a highly experienced Staff Engineer to lead the development and evolution of its internal platform. This role is critical in building robust, scalable, and self-service infrastructure that empowers hundreds of engineers across the company.
About Docker
Docker is a globally distributed, remote-first company trusted by over 20 million monthly users and billions of container image pulls. We build the essential tools that define how software is created and delivered, placing us at the forefront of AI's impact on software development. Our platform underpins hundreds of engineers and handles high-scale production traffic and data transfer daily.
The Challenge
Our platform has grown rapidly, and the current priority is to transition from expert-driven support to "paved roads" β self-service systems with clear ownership, safe defaults, and strong guardrails. The goal is to create a platform that teams trust implicitly, allowing them to focus on their products rather than our infrastructure. Key objectives include reducing the time to provision new global regions or application environments from days to hours, establishing a robust multi-region network architecture, and implementing a trusted testing and continuous deployment flow.
Responsibilities
As a Staff-level engineer, your impact will be measured by your leverage and technical leadership. You will:
- Translate ambiguous infrastructure problems into actionable proposals, driving them through RFCs and cross-team architecture reviews.
- Design and implement self-service capabilities and platform APIs (primarily in Go) for provisioning, deployment, observability, and day-2 operations.
- Establish delivery standards using Terraform, GitOps with Argo CD, progressive rollouts, and comprehensive testing, including building a missing continuous deployment flow.
- Enhance multi-tenant EKS foundations for reliability, security, scale, and cost-efficiency, including Envoy Gateway ingress and multi-region connectivity.
- Improve SLOs, alerting, and incident response on Grafana Cloud to enhance production stability and reduce reliance on heroics.
- Contribute to AI-assisted operations by shaping the role of AI in areas like alert enrichment, incident context gathering, runbook-assisted diagnosis, and onboarding assistants.
AI-Assisted Operations
We are investing in AI-driven workflows to reduce operational toil, ensuring they are safe, auditable, and human-reviewed. Your role will involve shaping the implementation of AI for:
- Alert enrichment and incident context gathering.
- Runbook-assisted diagnosis and remediation recommendations.
- Onboarding and readiness assistants.
On-Call
This role includes participation in the on-call rotation. As a Staff engineer, you will also focus on improving the on-call experience through better alerts, runbooks, reduced toil, and blameless postmortems.
Qualifications
- 8+ years of professional software engineering experience in backend, infrastructure, or platform engineering.
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
- Strong software engineering skills in Go or a similar language.
- Proven track record of designing, shipping, and operating cloud services or infrastructure platforms.
- Deep expertise in at least one of: Kubernetes, networking, cloud platforms, reliability engineering, or developer platforms, with solid Linux, networking, and production-ops fundamentals.
- Experience setting technical direction and leading cross-team alignment.
- Clear written and verbal communication skills for remote collaboration.
Nice to have: EKS, ingress/CNI/service-mesh experience; observability tools (OpenTelemetry/Prometheus/Grafana); CI/CD and progressive delivery (Argo CD, canaries); experience leading cross-team migrations or adoption programs.
What to Expect
- First 30 Days: Build context, meet teams, ship your first change, shadow on-call.
- First 90 Days: Own a strategic platform problem, lead an improvement from design to production.
- One Year Outlook: Lead a major cross-team initiative, such as self-service provisioning or multi-region networking foundations, establishing durable patterns for service development and operation.
Perks & Benefits
- Freedom & flexibility in work schedule.
- Designated quarterly Whaleness Days and end-of-year break.
- Home office setup support.
- 16 weeks of paid Parental leave (after 6 months).
- Technology stipend ($100 USD net/month).
- Generous PTO plan.
- Training stipend for conferences and courses.
- Equity in a growing startup.
- Docker Swag.
- Medical benefits, retirement, and holidays (vary by country).
- Remote-first culture.