Performance engineering, the Restaurant analogy




Can you help ‘tune’ this restaurant?





 
Recently a performance test COE I looked after went through massive expansion and we had to find a lot of resources in short time. Along with experienced resources we also looked at getting some fresh engineers and get them trained in performance testing. Question arose, as to what criteria we need to choose in screening the candidates. After some deliberation I came up on following.

 
1. High energy and motivated

2. Passion for technology

3. Quest for knowledge

4. Analytical mindset

 My able colleagues assisted me in rigorous interviews. I focused on evaluating analytical mindset aspect of the candidates. There was some disappointment in the process.  I was expecting the resources to floor me with their fresh and invigorating ideas and their analytical capabilities but it happened far too less for my liking. It’s my conjecture that social media is taking toll on mental development by taking away free time that we might spend on reflection and contemplation.

Following problem statement was evolved in gauging analytical capabilities of the candidates. I developed the problem statement and solution a little further for it’s pedagogical value.

A restaurant caters breakfast to customers, who are primarily en route to office. Though the restaurant is open from 7 AM to 9 AM most of the customers tend to come in time-frame of 8:00 to 8:30 AM. They are in real rush and tend to be impatient. Sometimes they leave while waiting in line, sometimes they leave without picking up order even after making the payment and sometimes they leave meals unfinished. From restaurant side they fail to keep up service level during that period. The seating area is messy, wrong orders are served and it runs out of soap in wash area e.t.c. The restaurant is not happy about it and they call this ‘candidate’ for consultation and help. Their objective is not to lose any revenues .The candidate is to be paid for one of month of work to study the problem and come up with possible solutions.

Described problem is very much akin to measuring and improving performance of application under peak load. I expected the candidates being interviewed to showcase their approach and possible solution to the aforesaid problem. We later used this analogy to introduce in detail performance testing and tuning concepts using easy to understand real world issues such as faced by a busy restaurant.
Approach
There is little difference between the ways the performance issues are approached for restaurant or an application. In this case the inception point is reporting of a perceived issue. We would first like to start with business expectation and define their expected restaurant or application service level. I usually tempered the enthusiasm of the candidates I interviewed stating that we before we jumped into various solutions we had to define the problem and approach. Before proceeding to solutions like incrementing space or serving staff, it might be useful to understand why this problem happens. For all we know it might be that master chef who may want to taste and smell every plate that goes out or it might be that pesky manager who wants report every few minutes from all or it might be the clunky utensils used for cooking etc. Possibilities are endless.








 
Solution
The table below illustrates congruence between the issues faced by restaurant and software application owners and helps define the issues, techniques and solution using terminology of daily parlance. In the interest of brevity, I will skim the verbiage related to performance concepts as I expect distinguished readers to be conversant with the illustrated concepts. That said, I have tried to emphasize on capturing trade-offs as much as possible because all tuning exercises are contextual and there is mostly some cost to it. There is much more to performance engineering than what I have managed to capture below. I do intend to refresh this periodically. The referred software application serves business processes of application processing, enquiry and reporting.
 
NFR, defect classification and tuning strategies will be compared below.
Application Context
Restaurant Context
Business Objective
At highest level, there is absolutely no difference in terms of business objective. The following business statement could apply to either for the application or a restaurant. ‘Earn daily revenue of, say, $10,000 at 20% margin and maintain customer satisfaction level over 8.5’.
NFR Formation
Define business expectation
Peak throughput handling capability of 200 applications per minute with given hardware and defined reporting and compliance behavior.
Peak capability to serve 700 customers in 30 minutes with given space, material quantity and quality, working resources quantity and quality, user behavior and environment constraints
Response time under 10 seconds under peak load of 700 applications. Total ‘application’ submission time not to exceed 2 minutes.
Maximum wait period at any counter of less than 30 seconds and overall waiting period not to exceed 90 seconds.
Additional latency permission of 2 seconds per member added to application
Additional wait period permissible for 10 second for every item added to single order
Latency reduction by 1 second for applications choosing pre-defined template
Average 10 second less waiting period per counter for customers choosing ‘menu of the day’.
Platform capacity to sustain peak load of 1500 and throughput of 350 applications per minute.
Restaurant should have capability to expand later to be able to cater up to 2500 customers in peak period of 7:30 to 8 AM with minimum of 70 orders being delivered per minute.
Availability SLA of 99.999%
Restaurant’s ‘unpublished serving down time to be no more than 1 day per 730 days. ‘
Vertical scalability
Per serving staff and square foot addition to yield capability to handle additional average additional 5 customers per minute with defined demand characteristic.
Horizontal scalability
Acquisition of additional floor space of equivalent size and adjacent location to yield capability to deliver at least 90% more order cumulatively and 75% under peak hour.
Acceptable error rate
No more than 1 re-orders per 100 orders per day.
Trade-off: 2 seconds additional latency permissible for clients accessing application through rich internet application interface
Trade-off: 50% additional wait period permissible for customers of premium meals area
 
Cloud: Surge computing requirement
Restaurant super chain: Ability to support 30% additional users over peak load with maximum of 10% negative variance in performance metrics based on tie-up with larger restaurant to get emergent supplies on need basis.
 
Defect Characterization and root cause analysis
Study and define the problem
Plenty of defects are sporadic and conditional. Defining conditions is necessary to reproduce and analyze the problem and provide the appropriate solution.
High response times or error rates under miscellaneous conditions
Issues faced under following scenarios. At restaurant opening
 
Around periodic activity such as staff change, material reinforcement, sanitation, manager audit, adjacent gym closure, office bus departure e.t.c
 
During peak
 
Around order placement of particular items
 
Around order placement of particular quantity
 
Around order placement from particular channel such as phone order from a group
 
After sustained pace of orders, even before peak period
 
At worst, even under most light load
Application freeze
Sporadically no order served for more than 60 seconds. ‘Two minutes- 200 orders served uniformly’ is not same as ‘one minute of 200 orders’ followed by one minute of freeze. Perception of performance as important as the performance itself.
High resource utilization
Restaurant staff completely zapped

 

 

Tuning strategies
Service management strategies
Round trip /IO and interrupts reduction

Catering trolley to contain all items required for single trip to tables so that serving staff does not have to return to kitchen. Serving staff equipped with knowledge so that they do not have to return to chef to answer queries raised by customers. In the kitchen, well stocked groceries so that restaurant does not have to seek reinforcements during the peak session. Any inputs as needed from chef, sought early in morning and not during peak serving duration.

Trade-off is that there is more upfront cost in analysis and training to minimize round trip. There will be greater space utilization in kitchen and larger trolleys used.

First thing first
At restaurant opening, initialization of kitchen area first. Separate area or customization of space for most ordered menu items.

Lesser ordered items will consequently move little slowly.
 
Batching
Take multiple orders from customers before sharing with kitchen, though it would work only for premium dining area where responsiveness can be traded for quality of service

 Strategy also applicable at time of start of day during meals preparation. Cleaning of raw materials in one batch, then spice mix preparation etc.
Batching will of orders might help during peak time but early morning customers may perceive loss of responsiveness as order taking staff waits for quorum before approaching kitchen. Certain chef will rue the process of having to wait until entire common preparations are completed.

Asynchronous design
 
Customers willing to pay now and come back later in a while after some shopping nearby and pick up their to-go custom orders.
 
Customers may not know how soon to come back. This arrangement not suitable for breakfast crowd. It might have worked for evening shopping crowd.
Fast path
Most used ingredient placed closest to or in center of cooking counters so that fetch time is minimized.
 
Effectiveness of this will depend of ability to find common ingredient. Certain uncommon menu items might take little longer.
Parallelization
Multiple payment counters. Associated serving table and kitchen delivery counter.
 
There might be more empty seats or more than required serving staff in our quest to have parallel serving and have ‘shared nothing’ mode.  More effort in guiding customers to their line.
Deadlock reduction
Multiple serving trolleys, wash counters, kitchen counters, refrigerator with multiple doors etc so that line formation possibilities are minimized at all points. Logically combine access. So complaint and improvement suggestion logs be placed in one file, so that only one customer can access it at any point of time.
 
There might be some apparent wait time when could have allowed one person to start. Like, chef could start without waiting for overnight cleaning to conclude. But that would hold back cleaning from completing as more waste would be generated during cooking and cooking would also be delayed as it may not be able to complete all parts whereby some cleaning is pending, such as for cooking utensils.
Pre-fetching
Glasses, tissues placed on tables at start of table. Common ingredient mixture prepared and place at cooking counters ahead of chef arrival.
 
Unless there is proper analysis, we might pre-fetch wrong quantities and then going back and getting right item or ingredient might be little more expensive.
Coupling
Referring to design pattern of matching the interface of object with its most frequent use. Custom spatula to each cooking counter with its own dish type. Trolley dimensions matched to space in alley.
 
This approach would require more upfront analysis. Maintainability would be lost with each counter with its own spatula type.
Resource efficiency. A ‘lean design’ in other words.
This is the most important performance factor. Cannot afford to cook elaborate exotic food items for breakfast. Simple and quick cooking is required.
 
There is little design trade-off in keeping the menu ‘light’ from performance perspective. But, there is trade-off with business needs. You need some visual or culinary appeal from customer retention perspective.  So sometimes in spite of performance impact, RIA would be chosen.
Connection management optimization
Right number of kitchen windows for ingoing and outgoing. Too less will cause queuing. Too many will cause excess windows management work.
Conflict reduction
No restaurant employee satisfaction surveys during peak hours. Customer interaction session with respect to food and service quality in off-peak hours.
 
Pre-requisite is analysis and implementing disciplined approach whereby business cannot get anything anytime they want. There might be perceived loss of service level.
Maximize resource utilization
Serving staff picks up plates on way back to kitchen, rather than taking separate trip to pick up left plates. Surround ingredient placement system for chef. Conveyor belt food delivery system, so that chef does not have to get up to deliver and is continuously engaged.
 
Apart from initial analysis effort, cost of scheduling, ie one person, like manager to ensure all staff are occupied.
Uniform resource utilization (as in load balancing) to ensure low standard deviation of metrics
Manager to monitor utilization and ensure uniform utilization of chef, serving staff and cashiers.
 
Additional monitoring effort required.
Efficient data archival strategy
Old utensils and spatulas etc that are retired as we bought in new spatulas and cooking utensils and they may be dumped in backyard store and not clutter the shelf space.

Select cases where reference is needed to old data will experience much higher wait time, if service is provided for them. Take an example of request for a dish that’s been recently taken off the menu owing to less demand. If at all served, it will take much more time to be served
Task segregation amongst thread based on work profile( ie separate  rendering thread on UI as opposed to the one interacting with server or user inputs)
In case there is parking issue, a separate attendant dedicated to help with parking and directions as this would be slow paced activity which cannot be meshed with other fast paced activities such as taking payments.
 
At lighter load levels, it might appear to be redundant arrangement.
 
Garbage collection optimization for efficiency and frequency
Right table cleanup frequency since they will be ‘stop the serving’ events. Not too delayed whereby users look for room to place plates and move the things around. Not too fast that serving staff is cleaning for every particle drop. Protocol to determine end of meals so that serving staff knows quickly what plate is to be picked up and what is to be left.
 
The trade-off is between table space, condiments access time, serving staff effort and diners acceptable wait time.
Avoid full table scan and scattered disk read
Having serving staff to ask all customers if they would like to have any take-aways may be wasteful. Only approach customers with particular type and quantity of dish. Possibly seat the users by menu along table.

Trade-off will be determined by volume of take-aways happening. Also there will be upfront preceding analysis to determine right ‘hints’ for serving staff to approach customers.
Caching
Place on shelf- cleaning towels, forks, salt and sugar. Right amount, so that the common shelf does not have to be re-stocked every few minutes. The items that do not require regular re-stocking are moved back. Verify there is quick uptake from this counter.
 
There is effort involved in quickly stocking common supplies, ensuring it’s highly utilized and not wasted.
Indexing
Labelled cooking material. A lot of work when placing the raw material with everything labelled and placed in sequence as per usage. Helps in quick fetch during peak hours. Trade-off is that off-peak hours work would increase. More space consumption creating indexed boxes and compartments.

There is more effort involved in placing the ingredients and items with respect to analysis, labelling and re-sorting. There might be tendency to over-analyze or over-compartmentalize beyond the common access needs
Wait reduction
Theoretically this is about I/O management. In restaurant context this is about setting right service frequency, resourcing levels and raw material distribution. First there is wait analysis to determine which of the activity such as cooking, cleaning, serving or payment collection experiences maximum wait periods. The solutions include choosing right work unit size. Example given- how many chairs and tables to clean in one go or how many plates to arrange in one go. With small work unit size contentions for space or time will be reduced. With large size there may be productivity gain but there will be overheads such as everyone having clear off the delivery area, where a chef chooses to delivery large order quantities in one go. Many order serving counters, making available commonly required ingredient such as butter from serving counters, right sized sorting area to separate utensils to different dish washers are few steps that can help with wait reduction.

There could be extra resource deployment to alleviate waits. This might also induce changes to the way restaurant works and would require wholesale changes as wait is an outcome of interaction of chefs with serving with cleaning and payment collection staff and is a function of load, productivity levels etc. Changes to menu, staff, and customer behavior would induce re-assessment of the strategy.
Lazy loading (in constrained environments)
Assume some customers in premium dining area do want their meals to be too embellished and decorated. Get the ingredients, but embellish only when asked and enquire after few embellishing if that would suffice. If all customers need all embellishing, then it would not be of any help.

Overall time might increase if this is applied to scenario where most customers need most embellishments anyway. Trade-off is that, customers who usually desire all embellishments will be on losing side but the one with few embellishing requirements will gain.
Re-use data
Cannot reuse used plates in restaurant like we would do in application to expedite although it would have been a great help. That said, unused tissue, glasses need not be restocked on tables. Wash area towel would be reused.
 
Possible trade-off in restaurant context could be hygiene and in application context transaction integrity.
 
Optimized data set and structures
Spot resistant table covers. Right width and length of tables. Low friction counters where plates are sorted. High friction counters in order serving counters.
 
The structure is highly designed for usage and serving pattern. Would require re-alignment should there be change.
Reduce fetched data size
More fetched sandwiches will help reduce round trips to order placement counter. But carrying 50 sandwiches stacked up like a tower would not be helpful either.  Just right amount of ketchup sachets as would be required in single trip. Picking up more than required in single trip would waste effort and also space on serving table.
 
Unless well analyzed there could be scenarios whereby serving staff might have to undertake more trips back to order pickup counter.
Supplement Resource
Obviously more chefs, serving, cleaning and payment collection staff may help if analysis shows shortages of one of them causing delays.
 
There could be higher cost due associated with supplemental resources. Reasonable delay, say 1 min in context of restaurant might be OK, unless the restaurant USP is to be fastest serving restaurant.
Limit performance inefficient operations- Unions, outer joint queries, code refactoring etc
Limit unique or uncommon cooking styles such as depending on intermediate output from another counter or mixing material late in interest of convenience but that which causes too much labor afterwards.
 
Trade-off could be convenience. Peak hours cooking practice may not be useful for off-peak hours and certain relaxation of standards may not hurt.
Specialized hardware
Having chefs who specialize only in particular dishes. Separate hiring and training of cleaning and serving staff.
 
There could be extra cost associated with above and more points of failure. Ie, in case of disruptions, performance would be much impacted.
In memory database
One large trolley which can carry a lot of material and can be efficiently managed by serving staff at same time. Almost like carrying entire kitchen on that trolley.
 
There will be cost element of this trolley and re-designing the operations to take advantage of this large sized trolley.

 




All tuning is contextual. Some establishments might appear disorganized and yet maintain the speed of serving, while others in spite of apparent high compartmentalization and structure might be tardy. There is a lot of performance tuning that goes on real life. I watch with interest as I see a lot of performance engineering concepts applied at photocopier, juice or laundry stations. As I started compiling this restaurant-application performance analogy I started with software performance concepts and looked for their implementation in restaurant’s context. By the end of the exercise I was looking for equivalent of restaurant performance tuning measures in application context. As an example, restaurant could resort to opening ‘cash and carry’ counter when loaded beyond capacity. Application could similarly redirect traffic or throttle service when user count or resource utilization exceeds predefined threshold. There is perhaps also a lot of tuning in nature, where thousands of leaves are placed over one another such that each receives adequate sunshine or perhaps in ant hills with efficient coordination among thousands of ants.

To end this discourse on a mirthful note I would like to share to few interesting folksy comments that I have come across from various stakeholder sharing their  ‘emotions’ about state of performance. “It’s dead. We slapped the server hard. Its backing up all the way. It’s so fast, results are going to burst out of the screen. We ratcheted up the load and hung the server dry. We threw a monster of machine at it. My favorite is the hyperbole usually after many weeks of testing and profiling- ‘there is something seriously wrong with this system’.
 
 
 
 

Comments

Vaidya Sumedh said…
Nice explanation and perfect case considered. Cheers!
Anonymous said…
Hi Vikram, I enjoyed reading the whole model, couple of observations for fine tuning :
1/ whats the minimum and maximum number of counters available in off peak and peak time ? (here we are dealing with 2 variables Order placement area and order delivery)
2/ Transaction will be in cash or via card? (assumed that order number will be generated at end of bill transaction acceptability! otherwise the latency will grew)
3/ I assume their is trolley packing area, and that will be asynchrous style packed by order packing management with customer order number, will that be assigned to deliverer, if yes in that case the thread is occupied and dependency is greater in order to deliver and come back to station!
cheers
waz
Vikram Chandna said…
This comment has been removed by the author.
Vikram Chandna said…
Hi Waz, Great thoughts, particularly related to asynch tasks in trolley packing area. I was struggling to think of asynch activities in restaurant. But now I think most real life activities are asynch. The counters would be equivalent of thread, allocated as and when more are needed. Regarding cash versus credit, that is where I think business need overrules that of performance. Both options need to be given. However there is an interesting choice- do we separate cash counters from credit counter? For sure there will be productivity gain, but then it might lead to load imbalance and thus to lead performance impact. regards,
Vikram Chandna said…
Vaidya and Waz and many other have left kind words for the article. There was a reference to poor hiring practices from good friend Dmitry on how some of the professionals fail to know the basics.

My viewpoint on same is that I respect the effort put by most performance testers in very constrained environment regardless of skill level. Over period of time most do develop bag of tricks that carry them through projects. So troubleshooting issues like application crash during recording, managing encryption, correlating large number of dynamic values etc are all appreciated.

The part that I have focussed on are the analytical skills. Core skills like pattern recognition, inductive reasoning, hypothesis development and design of experimentation etc are difficult to teach. After I established the analogy I ran it on one of my experienced resources. I was gobsmacked by his shanty analysis. Now he did reasonably OK in his projects and guess that kept the lights On, but it does creates pockets of analytical vacuum that someone else has to fill in.

I recall from many years ago working with a 'crack' performance testing team as short term contractor. There were numbers that made no sense as most were well below 1 second. Those were days of heavy EJBs and I had never seen nunbers like that and I made a lot of noise about it to investigate it further. The performance testing team sure was very talented and had good number of scenarios covered, with production like data, environment etc. Yet those basic inquisitive tendencies and analytical bent were missing. The article I wrote is just one more attempt to take a fresh look at performance concepts and view and analyze them through the prism of an analogy
Birat said…
Hi Vikram, I must say that you have explained the most difficult concepts of performance testing in the easiest possible way, by taking this example of restaurant.
Few point to add from my side:
1) fat client concept- making the client do some trivial job for the server, thus reducing server load. In restaurant case, it would be placing a soda machine, sauce counters, vending machines etc; like a self service.
2)using different interface to reduce load on one interface: ex using mobile/IVR to pre-order the breakfast & payment and the customer just needs to pick the parcel. This may work in our case as customer tend to order similar menu usually and also the restaurant will know how much food/inventory they need to prepare, by seeing the order quantity of the day.
3) localization of appln to reduce latency: people from one location should not travel 5-10 miles just to get breakfast instead an outlet can be opened in their area if the load is usually high from that area. we may investigate further on why the peak hour is at 8 - 8:30.
So this will increase revenue as one restaurant has a max limit of serving customer even if we keep on increasing the capacity.
4) Garbage collection: you already touched this point but just to add ; minor GC should be given more priority and should be more frequent & major GC should be avoided as much as possible during peak hour.ex: cleaning of plates/table vs cleaning whole kitchen/floor.
5) sequential job replaced by parallel jobs and resource sharing
ex: subways are example of sequential job (select bread->meat->veggies etc) and McD is an example of parallel as parallely orders are taken and made.
6) Batch job, you already touched: ex things like fries can be made in bulk and just kept heated for serving to many customer; other example items kept on counters can be pre-made.
7) bumping up the resources which is more utilized than other: ex: increasing the number of order/billing counters will have less improvement (as it takes 15 sec to take order but maytake 2-3 min to prepare) than increasing the work station of chef and no of chef. In appln terms most of the work & impact on response time is mainly by app & DB server rather Web server (whose job is to route req)

There are many more ideas/points in my mind but will save it for the day when we meet :)
Vikram Chandna said…
Birat, some very good points. Adding my analysis to same as I do want to consider trade-offs considered for all solutions.

1. Fat Client is intuitively helpful. It could help with wait minimization and resource efficiency. Where it could hurt is with deadlocks(soda counter could become new bottleneck), impede parallelization or impact resource availability if space crunch is one of current issues restaurant faces. Generally speaking I am inclined to completely parallel or nothing shared model where I want all customers to come to seat as quickly as possible and head straight to exit from there. Customers roaming around would introduce new points of serialization (at these places), interrupts (spilled soda or interfere with staff and serving trolley movement), round trips (back and forth to table) etc.

Pre-order, which is a little like pre-fetching, is a great idea and will work well if placed in time. Ordering at peak time doesnt help. I would think adding channel itself is not the help because in this case channel is not the bottleneck. Trade-off is that we have spread the work from peak to off-peak hours. We have to know how much spread would be achieved so that resource up accordingly. This impacts the working model of rest of the restaurant as it would need kind of staff and new queue management for those coming to pick up, again, possibly in short burst.

Localization, I would assume refers to distributed computing. I would want to think about cost. Would we have full serve and options branches at all places? It would depend of usage pattern. Consider Anakami, providing edge caching. Works for coprorates majorly having static pages. For dynamic traffic it would still have to be routed back to corporate servers. So if cookies and pretzels are all we want sell at remote counters, then by all means. Though remember redeucing customer travel time wasn't business objective for this exercise. We could probably achieve similar result by acquiring adjacent space in current location and maybe do even better if there are any synergies.

Subway sequential job is business take priority over performance. Performance does induce mass production and may be de-humanizing with conveyer belts, tokens etc. There is no reason why Subway could not build multiple parallel rows, but there is too much cost associated. Their business model is that for service customers can wait and they do.

With batches I was pointing to other extreme, where you would cause decrease in responsiveness. Consider when french fries run out during off-peak and you do want to have some orders before you cook again. Sure it is efficient, but for time conscious customers who would come early in off peak hours hoping to get quick service would be disappointed.

More resources should help if they are indeed the bottleneck.

Again no one answer as it is always contextual.
Vikram Chandna said…
Sorry a typo above, I was referring to 'Akamai' edge caching
Anu said…
finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing much information on the topic.
DevOps Training in Chennai

DevOps Online Training in Chennai

DevOps Training in Bangalore

DevOps Training in Hyderabad

DevOps Training in Coimbatore

DevOps Training

DevOps Online Training
Sowmiya R said…
Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts. Thank you for sharing any good knowledge and thanks for fantastic efforts.
oracle training in chennai

oracle training institute in chennai

oracle training in bangalore

oracle training in hyderabad

oracle training

oracle online training

hadoop training in chennai

hadoop training in bangalore

Popular posts from this blog

Complexity and us

The power of questioning and observation