⨠AI Insights & Summary
Runware is at the forefront of building the API layer for next-generation AI products, offering a unique opportunity for a Staff/Senior DevOps Engineer to shape the infrastructure behind a high-performance AI inference platform. This role is ideal for seasoned engineers who thrive at the intersection of bare-metal, GPUs, networking, and automation, and who are passionate about building resilient, scalable systems that power cutting-edge AI applications. If you're looking to make a significant impact on a remote-first, fast-paced team and contribute to the rapid evolution of AI products, this is an exceptional career move.
Staff/Senior DevOps Engineer - AI Inference Platform
Runware is seeking a Staff/Senior DevOps Engineer to join our remote-first team and play a critical role in building, operating, and scaling the infrastructure for our global AI inference platform. You will be instrumental in enhancing the speed, resilience, and operational efficiency of our systems, supporting the rapid growth of AI products.
About Runware
Runware provides the API layer for the next generation of AI products, offering fast, reliable, real-time inference across thousands of models through a single, flexible API. Our platform enables customers to build and scale media generation products with improved performance, reduced costs, and simplified operations. We are built on a foundation of speed, reliability, and GPU scale.
About the Role
As a Staff/Senior DevOps Engineer, you will design, build, and operate the systems powering real-time AI inference across large-scale GPU fleets and a global production platform. This role transcends traditional DevOps, focusing on the intersection of bare-metal infrastructure, GPUs, networking, automation, observability, and high-performance distributed systems. Your contributions will directly influence our ability to launch new models, scale customer traffic, recover from failures, and deliver low-latency AI experiences.
What You'll Do
- Build and scale the infrastructure for real-time AI inference across GPU fleets, bare-metal servers, and containerized production systems.
- Evolve the platform towards more elastic, on-demand infrastructure capable of rapid scaling with customer traffic and model demand.
- Enhance system speed, reliability, and resilience by optimizing critical paths including request entrypoints, inference services, queues, storage, load balancers, and networking.
- Automate infrastructure operations, covering provisioning, configuration, CI/CD, deployment safety, progressive rollouts, and rapid rollbacks.
- Develop the observability backbone for a high-performance AI platform, providing signals for early issue detection, capacity understanding, and proactive problem resolution.
- Lead in production operations, incident response, debugging, and post-incident improvements to strengthen the platform.
- Improve infrastructure security and compliance through patching, secrets management, access controls, hardening, auditability, and repeatable operational processes.
Requirements
- Strong experience as a DevOps Engineer, SRE, Infrastructure Engineer, Platform Engineer, or similar, with a proven track record of running production systems at scale.
- Deep Linux knowledge and confidence in debugging real-world production issues across networking, storage, performance, services, and system behavior.
- Hands-on experience building automation, Infrastructure-as-Code, CI/CD pipelines, and deployment workflows.
- Experience operating high-availability, low-latency, or high-throughput platforms where reliability and performance are critical.
- Strong networking fundamentals (TCP/IP, DNS, load balancing, routing, firewalls, proxies, TLS, HTTP).
- A calm, pragmatic approach under pressure, with strong communication, good judgment, and a bias toward automation.
Bonus Points
- Experience operating GPU infrastructure for AI/ML inference (e.g., NVIDIA drivers, CUDA, container runtimes, GPU monitoring, capacity planning, workload isolation).
- Familiarity with inference serving and optimization frameworks (e.g., vLLM, TensorRT, Triton).
Benefits
- Generous paid time off (vacation, sick days, public holidays).
- Meaningful stock options.
- Remote-first setup, allowing you to work from anywhere we can employ you.
- Flexible hours outside of core collaboration blocks.
- Paid family leave (maternity, paternity, caregiver).
- Twice-yearly company retreats in inspiring locations.