Building More Scalable and Reliable eCommerce Sites With SRE

If you want to delight customers, SRE is not an option, it’s a business imperative.

SRE, or Site Reliability Engineering, is a software engineering approach to IT operations that ensures scalable and highly reliable software systems. Websites—such as the one you’re reading right now—have to be reliable and consistent to provide the user experience customers want, which is why SRE is so important to delighting customers.

I see SRE as the thread that connects everything that we do from an eCommerce ecosystem standpoint. It spans hundreds of applications, thousands of services, many different boundaries and ensures that if anything goes wrong in our ecosystem, there are frameworks, instrumentation, monitoring and alerting, so that we can notify the teams that need to react as quickly as possible. And in turn, it reduces the mean time to resolve (MTTR) service incidents.

Using SRE, we have substantially reduced the amount of downtime for our Global eCommerce Experience and Commerce Platform experiences and lowered MTTR across the commerce ecosystem. This meant, for example, we had 100% availability during our five-day Black Friday/Cyber Monday Holiday period in 2021, a time when customers rely on our sites to be there for them. 

An Immune System for IT Operations

Our SRE efforts began with a push to improve our site reliability by not only making sure our eCommerce systems were reliable but also achieving faster incident response to reduce downtime. We set an initial goal of 99.9 percent availability or 8.77 hours of downtime a year and looked for what we could do differently.

A key strategy was to move away from our traditional, siloed operation support process where various levels of support engineers (L1, L2, etc.) responded to incidents for individual services that were not connected. Support teams generally learned of issues through customer complaints. Investigating the problem, determining the cause and then fixing the issue could go through several service levels and took far too much time. We also lacked a cohesive view of the customer experience across our organization and beyond.

As a result, we began to explore an SRE methodology. Using third-party vendor tools and customizations, we built instrumentation and dashboards to form a single pane of glass. We reduced L1 and L2 support and had dashboards alert our engineering teams directly to reduce the number of hand-offs.

Once we were able to flag incidents in real-time, we looked at whether there were infrastructure issues that could be healed automatically without human intervention. The team built out technologies, processes and systems to self-heal where possible. If a virtual machine went down, for example, we created a solution to automatically fix it using technology.

As our team created an SRE network, we linked into numerous other IT functions. The application services and systems that come together to provide our customer experience are distributed. That means they’re owned by multiple different organizations. So, while I own the experience, other teams own the platform, the underlying services, the infrastructure, databases, etc. Anything can go wrong at any point in any of these areas. SRE’s number one job is to make sure that our mechanisms and instrumentation are in place to alert the right team, as fast as possible, to reduce the downtime.

Think of SRE as being similar to the body’s immune system. Whether you have a cut or a virus, it basically detects the change and responds to it. SRE monitoring looks for changes that cause disturbances across an organization’s entire ecosystem and reacts to them. Just like an immune system would do, once we identify changes that cause disturbances, we learn from those outages and implement things that will proactively prevent them.

The other thing that SRE does is manage our infrastructure, automation and migration. For example, SRE is responsible for moving our workloads. Last year, we moved everything from our old data center to our modern software-defined data center. We are talking about thousands of virtual machines, hundreds of applications and hundreds of microservices. And SRE did that with zero downtime. 

SRE Benefits and Insights

Several years into our SRE transformation, we have gotten to 99.9 percent in availability and, in a lot of cases, 99.95. Overall, we gained our highest availability ever, highest customer satisfaction ever and lowest MTTR. It has also helped us increase the website’s speed because we moved to a much better infrastructure. And, of course, we have better alerting and can find and fix issues faster. We are still striving to improve some self-healing capabilities.

These are all worthy reasons why SRE is critical to pursue, but it does have challenges.

Interested in adopting SRE in your organization? The biggest hurdle you will likely face will be cultural, not technical. SRE means your organization is less reliant on traditional support, which requires shifting away from longstanding IT roles and skillsets. Our L1 and L2 critical operation team members are empowered by monitoring and observability and have refocused on building software and tools to give insight to our SRE team.

As with most transitions, you need to begin your SRE effort by setting goals across the organizations in your ecosystem. Will you strive for 99.9 percent availability or 99.99? You then need specialized SRE engineers for this unique practice. From there, you need to establish and empower SRE leaders to drive the change, which will require executive support.

And you have to do all these things without losing empathy for people who have helped you get so far with traditional support. You need to drive this change with the understanding that you are offering reskilling options and taking those teams along with you on this journey.

But the results are worth it. SRE not only makes the systems we rely on more resilient, it also helps us get ahead of potential issues in the future. SRE makes us better every day, creating more delighted customers and building lifetime value with them. To learn more about our SRE practice inside Dell, read Keeping Our Sites Up and Running with SRE.

Keep up with our Dell Digital strategies and more at Dell Technologies: Our Digital Transformation.