Today my colleague and i will talk about how to prepare for Black friday. I would imagine most of you Have never heard of asos i’m going to talk about who We are. We were established in 2000 and are currently the top Retailer. We were based in london, and have roughly 4,000 Staff. Most of these — but not all of them. We build our own Web site, a mobile applications both android and ios and Supporting services inhouse. In terms of customer traffic, Mobile traffic accounts for majority of 50 percent of all Requests. Our mission is to be the number one fashion destination for 20 somethings. And for as long as i can remember, almost seven years we Grow 25 to 30 percent every year and last year we turned out Almost $3 billion. So before you think ha, black friday is Easy to cope with, the next couple of slides will illustrate The scale of volume that we operate at. We currently have Roughly 87,000 products alive on our web site and add about 4,000 Every week. We have 16 million active customers and in april we Served just over 9 trillion requests in total. To make it a bit tough to establish for black friday we Have a landscape of six engineering teams that are Responsible for 40 services and each service is made up of Numerous components each Engineer is accounting services all the way from design and Product support. A hand full of releases several times a day. In 2018 so far we have done just over 1,000 releases. So given This level of change, understanding the landscape is Critical, to relate how we verify where we are. Black Friday weekend counts for about ten percent of our yearly Revenue. To say it’s important to us is a bit of an Understatement. So a technical failure for a handful of hours Is quite expensive. Being offline for approximately an Hour on black friday is about $4 million. That doesn’t count The damage to our reputation or cost of customers what the graph On the side shows is the graph over the years. On black friday In 2018, we served 1.9 Million Orders. Our product api peaked at four and a half thousand Times per second and page was viewed four and a half million Times. How do we prepare for black friday to handle such a Load doesn’t happen by chance a lot of planning and effort goes Into this we build distributor table in azure this includes Reasons, we scale independently and deploy independently. Distribution also helps us for other reasons. So first it Improves customer experience simply by having to serve as Close to the customer means spending less time traveling Up-and-down the wire. More importantly for us is high Available. If one route fails we can divert traffic to another Region. If you ignore everything else we talk about Today don’t ignore this. Monitor everything. Applications and correct inversion in place. Alerting systems with level care guest Your best people to work on it design to be fault tolerant test It you don’t want to find out on black friday it can’t cope with That load so therefore you are blind. Gee distributive world Are quite different from monolithic datacenter Applications. In a monolithic application failures tend to be Catastrophic in nature. You always have constant failure Over network errors. So this further increases the monitoring Alerting stack because you might be aware of a number of Customers impacted by small issues and unless your stack is Up to speck. When you design your applications don’t just Consider the happy path, consider Failure a more important test conditions. Figure out operational leads. For example, if service depends On the service being Smooth, you are going to have to liefer in Place and instead of calling service, you can message on the Queue and the service call, your operational leave needs to be Tested and you need to trigger that is it automatic manual and Plan up front. You need to verify your probes and traffic Route as expected. For example, If sql azure starts failing can you reroute traffic to a Different region. As mentioned previously it is vital that you Understand your application state at all-times. I can Answer questions such as can it scale, is it resilient. The Only way to do this is continuously test your Applications. This slide shows a high level Asoses architecture starting from the level it shows mobile Web channels and sees what the customers interact with next you Have digital services which provide functionality to the Customer facing channels such as payment processes retrieves Details. Some digital services utilize Third party offerings for functionality. Integration Platforms to integrate with enterprise systems or other Third parties. Additionally, the responsibility for Publishing data to make use of the data. Okay this diagram Shows a typical asos smart search in a bit more Detail, the service fabric clusters will Ultimately end up in a container based world. This will be in Time. Sql azure cosmosdb shortage or a combination of all To meet storage needs additionally typical published Events by service bus. As you can see each service is Responsible log in monitoring by The stack. At the time we built this the launch was pretty weak That has changed so now we’re integrating to map insights and Ins. I want to point out none of these services work in Isolation they are all called by something or services called by Something else or by both in most cases. Okay. In this Slide, i want to get over how complex services interactions Can be. For example, how we handle failure. Diagram one Shows our bag service and interaction. The bag service is Responsible for storing the items customers add to their bag In addition managing state in checking out delivering Addresses delivery options or vouchers the customer as used. As can you see the bag service dependence on a large number of Other services considerable time is spent how we make the bag Service possible if any of the Dependencies suffer. Services interactions involve the service A customer selects and adds to the bag. When Added to the bag product details and product Service customer select which is product they want to add to Their bag this involkswagen an bag service and once the product Details have been retrieved the bag service calls the stock Service to reserve the stock for the customer. The product Service is one of our most utilized services so to redies The load. To protect the Bag service from Degradation, we will perform service and stock For customer this does degrade the experience because the Customer will think they had something added to the bag but It has been canceled but we have been preferable to not taking Orders at all.>>Okay. Thanks for listening To me, now over to cat who is going to talk for how we Validate this in detail.>>Thanks. Hello everyone. Four years ago at the beginning of our journey of the platform Of our web site to marketing architecture the first step in That journey was to replatform our solutions. Following our Performance testing i set about inhouse performance testing Capability. I was responsible for asos path to show we have Feasible environments in place and tests to support a high Level of change through all engineering teams. We have Matured significantly over the last three years into a Permanent team of force and people. Back to black friday. How do i sleep at night knowing a major sale is looming? I know that we’re ready and ready at any Time. If we decided have a flash sale Tomorrow i would know based on full each team would know how Many cloud services they need to scale out to and what db size They need to scale up to. We achieve this by testing to full Volume and on peak infrastructure every single Week. I would be able to report when we last rehearsed each Offer operational leavers and Based on schedules that prioritize against each of our Core components and i know that any major change to please its Own specific nonfunctional requirements. I also know that The teams that develop the components for the web site are Responsible for supporting those components in production. And They have a strong desire to sleep at night too. Therefore They test early and often. Even minor changes can be performance Tested within a team’s pipeline to ensure meaningful Integration. I am here to explain how we have gone about This. Firstly i will explain that at a Very high level what performance testing capability looks like. Starting on the right in production, production is a Great low test environment. Our customers are putting out loads All the time 24/7 every day of the year and we can use this to Get actual real performance, utilization and metrics. We use A wide variety of tools to Gather information through technology specific tools to Gain insight into how their services and apps are performing And information can be fed back to the engineering life cycle. Outside of production we have a fully style production like Environment where we balance load tests using full customer journeys. This environment is on the path to our digital teams. To do This testing, we need to test the environment that fully Represents our production. A model that tells us how Customers use our site, user journeys captured in our work Load model testing that actually uses test script in a way that Accurately reflects our model so that we can execute a test that Fully reflects the volumes required for specific test Requirements. And finally but importantly, our Monitoring solution gives feedback on customer experience, Utilization and application behavior and load. As we have Such a high in one huge Environment we have to test the load early in the pipeline for Environment and production. It would bem possible to swap Performance testing without conducting earlier testing owned By development teams so that’s what we do. My team are working With all our development teams to support them in performance, Engineering capability so they can complete realistic service Level testing using all these tools and dedicated environment. We will come back to what teams do at service level in the later Slide but first i’m going to expand on each of these. Firstly our test environment this diagram represents A content delivery network to web Apps and apis which contain logic. Our test environment has To be as close to production configuration as possible to be Sure that testing doesn’t give false positives or negatives. Consider scaling. It scours production specific to scenarios On any given day. We don’t rely on all test styling as a general Rule for all cloud services due to the time it initiates a new Service instance we can track traffic that we would struggle To support. This will likely change. All of our services on In two regions in the north and us. To be sure there is a Full region allowance this means each Has to be able to support full load. We have full peak traffic Through a single region as shown on our previous slide our Environment for most teams our builds are Inline with production. As a safety net we also have a job That runs nightly to catch out of date bills preparing what’s In production is what’s in the test. Configureration changes Should also be made through platforms and deployment Definitions for tests. For example, when initiating a kill Switch in a resilience scenario we are scaling out to Scaleability test. Additionally we have many Complex considerations that are designed into our tests whether It’s prerequisite steps. Or program in effects to parameters To ensure the ratio. As far as the environment side we have and Production to our test environment to keep up to date. At the moment while we still have some legacy at stake this Is quite painful but in the future we are hoping to be Fully automated. I have many third Parties that we interact with for example, we use customer Log ins and data tracking. We have Many delivery partners. Our third party appreciates a peak Load for infrastructure every day. So we have our third party — As can you imagine this environment is a costly asset. So we scout. In our future we will able to Alter test environmentals on a daily basis And tear them down at the end we still have some work to do Together. Once our work load model, we need to understand how Our customers use our site and to peak conditions into normal Conditions. So that we have take this in our Tests. We instruct the solutions as a Combination of analytics. Page Level hits and the of the Journeys. This informs or lower Work level teams for their service level tests and allows Us to verify the tests appropriately. Ee also use the data to identify Endpoints. We will have spikes up to 50 orders per second is we Need to ensure that we can Support. Financial grace is 25 percent year on year but black Friday peaks are 40 to 50 percent bigger every year we Need to ensure that we can scout to support. We had another 30 Percent head room. You might wondered why an online retailer Needs a complex work load model surely a customer goes to a site Adds it to the bag and checks Out. We have different clients so we have ios and android apps As well as our web apps and they hit our endpoints differently. We have eight different sites, uk, us, germany, et cetera many Different languages and occurrences. Just to get to a Product page a customer could choose product, category, Jeans or whatever they like, they can Get it directly from a google search. They have different conversion rates during our sales period Typically a lot higher so that means we have a higher order. Our basket size is different as well. Different customers check Out with a different number of items in their basket and that Significantly affects our downstream calls and Calculations. We have many different payment methods that Are placed by different microservices. We also have Additional microservice that handle discounts and Subscriptions and they have sales and nonsales periods too. So what do we do with all this Information? first using visual studio web Site. Visual studio isn’t fully features out-of-the-box full Test and it has had technical challenges but it is incredibly Powerful and we have invested significant effort to customize It to make it the powerful solution to which we now have In place. We concept on industry Standard cloud based tool and it wasn’t able to work with that Cloud based model. We can implement a new node Model in line with the csb file. You can Create additional plug inns that come out-of-the-box scripting And reporting requirements such as reports pass and fail Metrics. One of the main benefits is being able to share Tools, techniques and experience with our development teams Within the visual studio id. The other man consideration for Our scripts is using a realistic spread of data specified in our Work load model support implementing to cashibility and Repeatability. The next component is load test scenario. Scenarios are designed to answer these key questions. But as a Developer, how are my build and path production when i release It. How will our web site perform? how much load can our end to end Solutions support support what will we not be able to scale Ourselves out of and what happens if such and such Component fails? and how will this affect the Customer journey. To be sure everyone understands the Availability environment we have scheduled to cope with all of These environments four days a week we run a normal day load Test this is for deployment rehearsals if teams need them. With powerful monitoring Capability for visual studio metrics, teams can deploy a Release to our test environment and immediately get feedback on Any potential impact to our customers. I’ll talk more about This in the monitoring solution slides. Every week we run a full peak load test to ensure that none of The smaller changes in production have impacted our Ability to support the next black friday. And then once a Month we run a scaleability test up to double what we would for Our peak load this is to define points of solution that we can’t Scale ourselves out of and gives plenty of time to resolve future Issues. There is not much point going beyond that load because Of the level of change that we have by the time we get to those Volumes those solutions would have changed significantly. We Then have a schedule of prioritized resiliency that we Work with nightly. I’ll go into this in more detail in the next Few slides. On top of our normal schedule we have to Sometimes accommodate larger Scale changes in a test environment. In an ideal world All offer changes would be small and incremental but we can’t Always follow that model we have to follow this on an ad hoc base Like everything else. We tend to keep these at a Minimum and isolate as much as possible. Look at resiliency testing in more detail. I will follow up On the earlier example in our customer journey. In a normal Scenario, the customer arrives at the product Page selects the product, size, and color and Clicks to add to bag. It makes the call to reserve that stock Then the customer continues shopping goes to check out and Places the order. In the workflow, the order service Places a message on the queue for the stock api for the stock Order. Now the stock only has one Active region if in the space region if we depended On a stock call no customers would be Able to stock their bag meaning we would not be able to take Orders. In the event of a stock base issue or stock api issue, It’s implemented two mechanisms. If the initial stock reservation Call fails it either times out with some other error the bag Service will reach by one the customer journey will respond With a successful call without a reservation. It is not stamping The customer experience and the item may not be available by the Time the order is placed and the database is back online this Would result in the cancellation from that part of the order but It is far better than sending our customers away. Second we Have operational, the bag can stop if the stock is struggling To service request and initiate a kill switch this avoids Analyzing an already struggling service with retries. There are A series of tests that need to happen every so often to remain Operational at all times but especially at peak when what can Go wrong will go wrong. Our point is to plan out the steps In detail with the expected results plan the development Teams to support and conduct the scenario, noting down the actual Behavior, the components involved and any addition Observation. The added benefit of this approach is that it Ensures our teams are familiar with support. I am going to use The next few slides to quickly illustrate how we measure the Impact of these test steps. Performing on our roadmap on a Chaos approach this will involve disabling or reducing the Components of the solution while at load and at random, reducing Ability to code and respond to alerts and result in the Underlying cause. So this is a screen grab from Our monitoring solution. During the last one resiliency test as I have just described here we can see the exceptions on the Top right, and execution time bottom left you can see here the Stock base failure was initiated at 12:15 the rise in exceptions Start bagging orders api. In the bottom right we see slow Request execution time within the stock service going up to a Minute. And then the kill switch was initiated in the bag At 12:30 in the box on the left we see a drop in the stock api And top right a drop in exceptions and bag and orders. This is all as expected. This next slide, gives a Representation of our customer experience showing transaction Failures at the top and response times in the bottom. As far as Test users are concerned. We can see low failure rates Throughout. We see a jump in response times up to 500 mill i Seconds when the stock base is offline and the retry is being Made from the bag api and then we see the covered kill switch Once it’s been triggered in the portal we have demonstrated that Our solution is resilient in these situations and the Customer journey is largely unaffected. I have added this slide to demonstrate how we can overlay Deployments on our other graphs. Each of the red lines shows Deployment where you can highlight and see updated to Build version. What this demonstrates is the rate of Change that we have going to the environment. We have to Maintain the path while we’re running a critical scenario. We can see that measuring our customer experience and Analyzing our system behavior and utilization and delay gives Us visibility of the effects of the load itself at deployment And component failure if we didn’t have the monitoring Capability we would have no feedback on how well the system Of coping we would not be able to identify ultimate scale Ultimatumization and performance issues. When we launched the platform Web site we implemented a search and balance solution for web Application log we used these to help us investigate specific Application errors when we run to load and also to gather Outputs statistics for work model. It’s important for us to Have an end to end user performance across the tire user Journey and tool set. Where we can we push all metrics into a Single input user tabes and use that data. This includes our Official test metrics application and infrastructure Metrics from our cloud services and database as well as Implementing deployment information. Teams are gradually migrating to insight applications and working Towards consolidating across nonproduction and all platforms Using insights and dashboards with all configured through our Deployment plan. That’s covered our to end end Test capability at quite a high level since we started Replatforming the web site my team and i have been working With engineering teams with performance engineering to Develop capability within each one what does this mean that We’re trying to achieve it can’t support such a high level of Change and performance test suggest on a path that is shared By so many teams. Our environment would have very low available and would support A low load of change and share performance issues that would be Costly to fix. Nonfunctional testing made as important as Functional testing. Each of our engineering teams is responsible For are the full solution life cycle performance and resiliency As well as monitoring and alerting need to be considered Throughout the life cycle in requirements and design the Development and tests, and into operations. We want things To be empowered to completely test from all solution angles. We Are performing strong performance principles into Early testing. This needs to consider the fundamental Developments a good environment that’s representative for what They need a work load model that represents the work load they Will have in production scripts and scenarios that follow that Model approximate model solution to give feedback. What we’re Trying to achieve in our test environment is early feedback on Performance and scaleability of new components feedback on Degradation of existing components and having Engineering teams implementing and performance at service level With minimal reliance on a service function. We want to Journey and have a way to go and are making significant progress And are activelien teams. What is that goal and What does this look like. Where It involves new components teams have to consider what are Endpoints expected? what are the target response Time and what are availability requirements? we design for resiliency and Performance for example, we want to ensure that services or third Parties are efficient with minimal chatter and minimal Message size. We have to consider regional assets how Many should we run in as we maintain integrity during an Outage and manage customer jerney and what logging does the Team need to address production Issues. Shipping to development As soon as possible implementing new features leaving behind Features where necessary. Implementing the full pipeline At the beginning with performance testing embedded. When we’re looking at what it looks like in terms of the Pipeline an engineer will make a code chain, compile and get Feedback on performance concerns and static analysis tool. The Test and check in. On the build server the unit test again and Build on the test environment. On the test environment as well As the functional test of block Performance test would also be run again at the service Level with the scope. This will be at Full peak load at full production scaling. It’s must Have to give identification to degradation performance and a Full failure. The same test will also be executed overnight But likely into peak load and for an hour longer. When Determining whether the build can conceive through path life To production the results of the latest test run the build run And validation run. And determine the pass or fail Status. A fail will require manual intervention and Investigation. Feedback, monitoring and Production and feedback requirements. We’re Refining this approach with the teams to maintain continuous Delivery model. We can roll it out to the teams as they become Ready. Let’s look at the steps of the journey and see where we Are now. Our approach has been to gradually shift our load Testing efforts in the pipeline giving teams earlier feedback on Performance. The first step is the teams to run overnight Performance tests against the end component With dependency in a like environment. This is in A large portion of our teams now. My team provides training Support and advice on how to go about any best practice, tips And tricks. Functional testing against an api, a short hop to Create and run similar tests. It can leverage the value from The tests that they have already. Teams can use work Load model for their tests and then translate functional tests Into visual studio web tests that use the same call but with A realistic set of data and scenario. Challenges test Metrics and tests run over time. The example there is response Times but you can look at any other metrics or high level Metrics. Next step in shipping will execute these same tests to Run after ever build that’s deployed in test environment, Their test environment which a handful of teams now have in Place this is a test for feedback into the pipeline. One of the key challenges is that interpretation of test Results can be quite heavy manual process and is a bit of a Black art. The challenge has been determining what should Trigger an alert and what makes pass or fail. With configured alerts based on threshold value, Samples and test time over 200 mill i seconds will result in Alert. Or if it goes over 70 metricses Can be configured too. Performance marks degradation For that particular run even if it hasn’t reached a specific. Transforming performance over time heavily Relying on automated feedback we need to be Able to rely on metrics whether they be visual studio response Time application error or warning lights or system Utilization metrics. So we analyze the gradeient of change Over the ten previous runs and if it falls outside of the zones It triggers the alert. Please excuse the — we’re translating Alert logic into test failure Being careful to void false negatives that allows fully Automated pipeline. We have to point out that this is really Just finger on the pulse analysis it covers balances Against com components. In Technology that require more specific technologies to be run Including spike resiliency and speck calculations. It doesn’t Replace end to end testing which is identify issues in service Integration at load what may seem like a small service in Your load may increase the dependency on the downstream System. A key is to establish load Failure shift so we can consistency conduct resiliency Testing as soon as possible this will allow achieving end to end Testing at multiple level approaches. That wraps it up for this session. If you a Highly scaleable and resilient web site that can support a Large volume of sales you have to build these principles into Every slice of your development life cycle and into each one of Your development team test early test continuously and test Everything that may be thrown at you. Whatever testing you do in The earlier stages you’re never going to get full confidence Until you test full end to end solutions at scale, try to break It. Thank you very much for your time. We’ll hang around Here to answer questions at the end if anyone has any.