✨ AI Insights & Summary
Experian is seeking a seasoned Site Reliability Engineer (SRE) to bolster the resilience and performance of their critical AWS cloud infrastructure. This leadership role offers a significant opportunity to define and implement SRE best practices, mentor a growing team, and directly influence business-critical systems. With a strong emphasis on automation, self-healing systems, and advanced observability tools, this position is ideal for an experienced SRE professional looking to drive strategic reliability initiatives within a global data and technology leader known for its people-first culture and commitment to innovation.
Site Reliability Engineer (SRE)
Company Description
Experian is a global data and technology company dedicated to powering opportunities for people and businesses worldwide. We are innovators in lending practices, fraud prevention, healthcare simplification, marketing solutions, and the automotive market, leveraging our unique combination of data, analytics, and software. We also assist millions in achieving their financial goals by saving them time and money.
Operating across diverse markets including financial services, healthcare, automotive, agribusiness, and insurance, Experian invests in its people and advanced technologies to unlock the power of data. As a FTSE 100 Index company listed on the London Stock Exchange (EXPN), we employ 22,500 people across 32 countries, with our corporate headquarters in Dublin, Ireland. Learn more at experianplc.com.
Job Description
We are looking for a Site Reliability Engineer to enhance the reliability and performance of our business-critical systems. Reporting to our Head of SRE, you will concentrate on AWS cloud infrastructure, DevOps tooling, and core SRE practices within a distributed, production environment.
Main Responsibilities:
- Leadership & Strategy
- Define and implement SRE best practices across the organization.
- Demonstrate proven expertise in production support, engineering, disaster recovery (DR), automation, and cloud operations.
- Mentor and guide a team of SREs, fostering their professional growth.
- Collaborate with senior stakeholders to align reliability goals with business objectives.
- Reliability & Performance
- Establish Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for critical services and ensure adherence.
- Drive initiatives to improve system resilience and reduce operational toil.
- Exhibit excellence in designing systems that detect and remediate issues without manual intervention, including Self-Healing systems and Runbook automation.
- Utilize exposure to tools like Gremlin, Chaos Monkey, and AWS FIS to simulate outages and enhance fault tolerance.
- Incident Management
- Serve as the primary point of escalation for critical production issues and lead major incident response, root cause analysis, and postmortems.
- Conduct detailed post-incident investigations to identify underlying causes, document findings, and share learnings to prevent recurrence.
- Implement preventive measures and continuous improvement processes.
- Observability
- Champion monitoring, logging, and alerting strategies using tools such as Prometheus, Grafana, ELK, and AWS CloudWatch.
- Build real-time dashboards to visualize system health and reliability metrics.
- Configure intelligent alerting based on anomaly detection and thresholds.
- Integrate metrics, logs, and traces to enable root cause analysis and reduce Mean Time to Resolution (MTTR).
- Possess knowledge of AIOps or ML-based anomaly detection for proactive reliability management.
- Collaboration
- Work closely with development teams to integrate reliability into application design and deployment.
- Promote a culture of shared responsibility for uptime and performance across engineering teams.
Qualifications:
- Deep expertise with various AWS services. Advanced knowledge of monitoring and observability tools.
- Strong leadership capabilities with a focus on setting clear direction, aligning team efforts with organizational goals, and maintaining high levels of motivation and engagement across the team.
- Excellent communication skills, with the ability to articulate complex ideas, solutions, and feedback clearly to both technical and non-technical stakeholders. Adept at managing conflict constructively and facilitating consensus.
- Proven track record of building secure, mission-critical, high-volume transaction web-based software systems, preferably in regulated environments (finance and insurance industries).
- Hands-on technologist with experience in software development, including leading an SRE team.
Additional Information:
- Hybrid working: 2 days a week in our Nottingham Office.
- Compensation: Great compensation package and discretionary bonus.
- Core benefits include pension, Bupa healthcare, Sharesave scheme, and more.
- Annual Leave: 25 days plus 8 bank holidays and 3 volunteering days. Additional annual leave can be purchased.
Experian's culture and people are important differentiators. We focus on what matters: DEI, work/life balance, development, authenticity, collaboration, wellness, reward & recognition, volunteering, and more. Experian's people-first approach is award-winning, including World's Best Workplaces™ 2024 (Fortune Top 25), Great Place to Work™ in 24 countries, and Glassdoor Best Places to Work 2024. Explore Experian Life on social media or our Careers Site to learn more.
Experian is proud to be an Equal Opportunity and Affirmative Action employer. Innovation is integral to Experian’s DNA and practices, and our diverse workforce drives our success. Everyone can succeed at Experian and bring their whole self to work, irrespective of their gender, ethnicity, religion, colour, sexuality, physical ability, or age. If you require accommodation for a disability or special need, please inform us at the earliest opportunity.
Experian Careers - Creating a better tomorrow together
Find out what it's like to work for Experian by clicking here
#LI-Hybrid