Performance engineering, the Restaurant analogy
Can you help ‘tune’ this restaurant?
Recently
a performance test COE I looked after went through massive expansion and we had
to find a lot of resources in short time. Along with experienced resources we
also looked at getting some fresh engineers and get them trained in performance
testing. Question arose, as to what criteria we need to choose in screening the
candidates. After some deliberation I came up on following.
1.
High energy and motivated
2.
Passion for technology
3.
Quest for knowledge
4.
Analytical mindset
Following problem
statement was evolved in gauging analytical capabilities of the candidates. I
developed the problem statement and solution a little further for it’s
pedagogical value.
A restaurant
caters breakfast to customers, who are primarily en route to office. Though the
restaurant is open from 7 AM to 9 AM most of the customers tend to come in
time-frame of 8:00 to 8:30 AM. They are in real rush and tend to be impatient.
Sometimes they leave while waiting in line, sometimes they leave without
picking up order even after making the payment and sometimes they leave meals
unfinished. From restaurant side they fail to keep up service level during that
period. The seating area is messy, wrong orders are served and it runs out of
soap in wash area e.t.c. The restaurant is not happy about it and they call
this ‘candidate’ for consultation and help. Their objective is not to lose
any revenues .The candidate is to be paid for one of month of work to study the
problem and come up with possible solutions.
Described problem is very
much akin to measuring and improving performance of application under peak
load. I expected the candidates being interviewed to showcase their approach
and possible solution to the aforesaid problem. We later used this analogy to
introduce in detail performance testing and tuning concepts using easy to
understand real world issues such as faced by a busy restaurant.
Approach
There is little difference
between the ways the performance issues are approached for restaurant or an
application. In this case the inception point is reporting of a perceived
issue. We would first like to start with business expectation and define their
expected restaurant or application service level. I usually tempered the
enthusiasm of the candidates I interviewed stating that we before we jumped
into various solutions we had to define the problem and approach. Before
proceeding to solutions like incrementing space or serving staff, it might be
useful to understand why this problem happens. For all we know it might be that
master chef who may want to taste and smell every plate that goes out or it might
be that pesky manager who wants report every few minutes from all or it might
be the clunky utensils used for cooking etc. Possibilities are endless.
Solution
The table below illustrates
congruence between the issues faced by restaurant and software application
owners and helps define the issues, techniques and solution using terminology
of daily parlance. In the interest of brevity, I will skim the verbiage related to performance concepts as I
expect distinguished readers to be conversant with the illustrated concepts.
That said, I have tried to emphasize on capturing trade-offs as much as
possible because all tuning exercises are contextual and there is mostly some
cost to it. There is much more to performance engineering than what I have
managed to capture below. I do intend to refresh this periodically. The
referred software application serves business processes of application
processing, enquiry and reporting.
NFR, defect classification and tuning strategies will be compared below.
Application Context
|
Restaurant Context
|
Business
Objective
|
|
At
highest level, there is absolutely no difference in terms of business
objective. The following business statement could apply to either for the
application or a restaurant. ‘Earn
daily revenue of, say, $10,000 at 20% margin and maintain customer
satisfaction level over 8.5’.
|
|
NFR Formation
|
Define business
expectation
|
Peak
throughput handling capability of 200 applications per minute with given
hardware and defined reporting and compliance behavior.
|
Peak
capability to serve 700 customers in 30 minutes with given space, material
quantity and quality, working resources quantity and quality, user behavior
and environment constraints
|
Response
time under 10 seconds under peak load of 700 applications. Total
‘application’ submission time not to exceed 2 minutes.
|
Maximum
wait period at any counter of less than 30 seconds and overall waiting period
not to exceed 90 seconds.
|
Additional
latency permission of 2 seconds per member added to application
|
Additional
wait period permissible for 10 second for every item added to single order
|
Latency
reduction by 1 second for applications choosing pre-defined template
|
Average
10 second less waiting period per counter for customers choosing ‘menu of the
day’.
|
Platform
capacity to sustain peak load of 1500 and throughput of 350 applications per
minute.
|
Restaurant
should have capability to expand later to be able to cater up to 2500
customers in peak period of 7:30 to 8 AM with minimum of 70 orders being
delivered per minute.
|
Availability
SLA of 99.999%
|
Restaurant’s
‘unpublished serving down time to be no more than 1 day per 730 days. ‘
|
Vertical
scalability
|
Per
serving staff and square foot addition to yield capability to handle
additional average additional 5 customers per minute with defined demand characteristic.
|
Horizontal
scalability
|
Acquisition
of additional floor space of equivalent size and adjacent location to yield
capability to deliver at least 90% more order cumulatively and 75% under peak
hour.
|
Acceptable
error rate
|
No
more than 1 re-orders per 100 orders per day.
|
Trade-off:
2 seconds additional latency permissible for clients accessing application
through rich internet application interface
|
Trade-off:
50% additional wait period permissible for customers of premium meals area
|
Cloud:
Surge computing requirement
|
Restaurant
super chain: Ability to support 30% additional users over peak load with
maximum of 10% negative variance in performance metrics based on tie-up with
larger restaurant to get emergent supplies on need basis.
|
Defect Characterization and root cause analysis
|
Study and define the problem
|
Plenty
of defects are sporadic and conditional. Defining conditions is necessary to
reproduce and analyze the problem and provide the appropriate solution.
|
|
High
response times or error rates under miscellaneous conditions
|
Issues faced under following scenarios. At
restaurant opening
|
Around
periodic activity such as staff change, material reinforcement, sanitation,
manager audit, adjacent gym closure, office bus departure e.t.c
|
|
During
peak
|
|
Around
order placement of particular items
|
|
Around
order placement of particular quantity
|
|
Around
order placement from particular channel such as phone order from a group
|
|
After
sustained pace of orders, even before peak period
|
|
At
worst, even under most light load
|
|
Application
freeze
|
Sporadically
no order served for more than 60 seconds. ‘Two minutes- 200 orders served
uniformly’ is not same as ‘one minute of 200 orders’ followed by one minute
of freeze. Perception of performance as important as the performance itself.
|
High
resource utilization
|
Restaurant
staff completely zapped
|
Tuning strategies
|
Service management strategies
|
Round
trip /IO and interrupts reduction
|
Catering trolley to
contain all items required for single trip to tables so that serving staff does
not have to return to kitchen. Serving staff equipped with knowledge so that
they do not have to return to chef to answer queries raised by customers. In
the kitchen, well stocked groceries so that restaurant does not have to seek
reinforcements during the peak session. Any inputs as needed from chef, sought
early in morning and not during peak serving duration.
Trade-off is that there is
more upfront cost in analysis and training to minimize round trip. There will
be greater space utilization in kitchen and larger trolleys used.
|
First
thing first
|
At
restaurant opening, initialization of kitchen area first. Separate area or
customization of space for most ordered menu items.
Lesser ordered items will consequently move little slowly. |
Batching
|
|
Asynchronous
design
|
Customers
willing to pay now and come back later in a while after some shopping nearby
and pick up their to-go custom orders.
Customers
may not know how soon to come back. This arrangement not suitable for
breakfast crowd. It might have worked for evening shopping crowd.
|
Fast
path
|
Most
used ingredient placed closest to or in center of cooking counters so that
fetch time is minimized.
Effectiveness
of this will depend of ability to find common ingredient. Certain uncommon
menu items might take little longer.
|
Parallelization
|
Multiple
payment counters. Associated serving table and kitchen delivery counter.
There
might be more empty seats or more than required serving staff in our quest to
have parallel serving and have ‘shared nothing’ mode. More effort in guiding customers to their
line.
|
Deadlock
reduction
|
Multiple serving trolleys,
wash counters, kitchen counters, refrigerator with multiple doors etc so that
line formation possibilities are minimized at all points. Logically combine access.
So complaint and improvement suggestion logs be placed in one file, so that
only one customer can access it at any point of time.
There
might be some apparent wait time when could have allowed one person to start.
Like, chef could start without waiting for overnight cleaning to conclude.
But that would hold back cleaning from completing as more waste would be
generated during cooking and cooking would also be delayed as it may not be
able to complete all parts whereby some cleaning is pending, such as for cooking
utensils.
|
Pre-fetching
|
Glasses,
tissues placed on tables at start of table. Common ingredient mixture
prepared and place at cooking counters ahead of chef arrival.
Unless
there is proper analysis, we might pre-fetch wrong quantities and then going
back and getting right item or ingredient might be little more expensive.
|
Coupling
|
Referring
to design pattern of matching the interface of object with its most frequent
use. Custom spatula to each cooking counter with its own dish type. Trolley
dimensions matched to space in alley.
This
approach would require more upfront analysis. Maintainability would be lost
with each counter with its own spatula type.
|
Resource
efficiency. A ‘lean design’ in other words.
|
This
is the most important performance factor. Cannot afford to cook elaborate
exotic food items for breakfast. Simple and quick cooking is required.
There
is little design trade-off in keeping the menu ‘light’ from performance
perspective. But, there is trade-off with business needs. You need some
visual or culinary appeal from customer retention perspective. So sometimes in spite of performance impact,
RIA would be chosen.
|
Connection
management optimization
|
Right
number of kitchen windows for ingoing and outgoing. Too less will cause
queuing. Too many will cause excess windows management work.
|
Conflict
reduction
|
No
restaurant employee satisfaction surveys during peak hours. Customer
interaction session with respect to food and service quality in off-peak
hours.
Pre-requisite
is analysis and implementing disciplined approach whereby business cannot get
anything anytime they want. There might be perceived loss of service level.
|
Maximize
resource utilization
|
Serving
staff picks up plates on way back to kitchen, rather than taking separate
trip to pick up left plates. Surround ingredient placement system for chef.
Conveyor belt food delivery system, so that chef does not have to get up to
deliver and is continuously engaged.
Apart
from initial analysis effort, cost of scheduling, ie one person, like manager
to ensure all staff are occupied.
|
Uniform
resource utilization (as in load balancing) to ensure low standard deviation
of metrics
|
Manager
to monitor utilization and ensure uniform utilization of chef, serving staff
and cashiers.
Additional
monitoring effort required.
|
Efficient
data archival strategy
|
|
Task
segregation amongst thread based on work profile( ie separate rendering thread on UI as opposed to the
one interacting with server or user inputs)
|
In
case there is parking issue, a separate attendant dedicated to help with
parking and directions as this would be slow paced activity which cannot be
meshed with other fast paced activities such as taking payments.
At
lighter load levels, it might appear to be redundant arrangement.
|
Garbage
collection optimization for efficiency and frequency
|
Right
table cleanup frequency since they will be ‘stop the serving’ events. Not too
delayed whereby users look for room to place plates and move the things
around. Not too fast that serving staff is cleaning for every particle drop.
Protocol to determine end of meals so that serving staff knows quickly what
plate is to be picked up and what is to be left.
The
trade-off is between table space, condiments access time, serving staff
effort and diners acceptable wait time.
|
Avoid
full table scan and scattered disk read
|
|
Caching
|
Place on shelf- cleaning
towels, forks, salt and sugar. Right amount, so that the common shelf does not
have to be re-stocked every few minutes. The items that do not require regular
re-stocking are moved back. Verify there is quick uptake from this counter.
There is effort involved in quickly stocking common supplies, ensuring it’s highly utilized and not wasted. |
Indexing
|
|
Wait
reduction
|
|
Lazy
loading (in constrained environments)
|
Assume some customers in
premium dining area do want their meals to be too embellished and decorated.
Get the ingredients, but embellish only when asked and enquire after few
embellishing if that would suffice. If all customers need all embellishing, then
it would not be of any help.
Overall time might increase if this is applied to scenario where most customers need most embellishments anyway. Trade-off is that, customers who usually desire all embellishments will be on losing side but the one with few embellishing requirements will gain. |
Re-use
data
|
Cannot
reuse used plates in restaurant like we would do in application to expedite
although it would have been a great help. That said, unused tissue, glasses
need not be restocked on tables. Wash area towel would be reused.
Possible
trade-off in restaurant context could be hygiene and in application context
transaction integrity.
|
Optimized
data set and structures
|
Spot
resistant table covers. Right width and length of tables. Low friction
counters where plates are sorted. High friction counters in order serving
counters.
The
structure is highly designed for usage and serving pattern. Would require
re-alignment should there be change.
|
Reduce
fetched data size
|
More
fetched sandwiches will help reduce round trips to order placement counter.
But carrying 50 sandwiches stacked up like a tower would not be helpful
either. Just right amount of ketchup
sachets as would be required in single trip. Picking up more than required in
single trip would waste effort and also space on serving table.
Unless
well analyzed there could be scenarios whereby serving staff might have to
undertake more trips back to order pickup counter.
|
Supplement
Resource
|
Obviously
more chefs, serving, cleaning and payment collection staff may help if
analysis shows shortages of one of them causing delays.
There
could be higher cost due associated with supplemental resources. Reasonable
delay, say 1 min in context of restaurant might be OK, unless the restaurant
USP is to be fastest serving restaurant.
|
Limit
performance inefficient operations- Unions, outer joint queries, code
refactoring etc
|
Limit
unique or uncommon cooking styles such as depending on intermediate output
from another counter or mixing material late in interest of convenience but
that which causes too much labor afterwards.
Trade-off
could be convenience. Peak hours cooking practice may not be useful for
off-peak hours and certain relaxation of standards may not hurt.
|
Specialized
hardware
|
Having
chefs who specialize only in particular dishes. Separate hiring and training
of cleaning and serving staff.
There
could be extra cost associated with above and more points of failure. Ie, in
case of disruptions, performance would be much impacted.
|
In
memory database
|
One
large trolley which can carry a lot of material and can be efficiently
managed by serving staff at same time. Almost like carrying entire kitchen on
that trolley.
There
will be cost element of this trolley and re-designing the operations to take
advantage of this large sized trolley.
|
To end this discourse on a
mirthful note I would like to share to few interesting folksy comments that I
have come across from various stakeholder sharing their ‘emotions’ about state of performance. “It’s
dead. We slapped the server hard. Its backing up all the way. It’s so fast,
results are going to burst out of the screen. We ratcheted up the load and hung
the server dry. We threw a monster of machine at it. My favorite is the
hyperbole usually after many weeks of testing and profiling- ‘there is
something seriously wrong with this system’.
Comments
1/ whats the minimum and maximum number of counters available in off peak and peak time ? (here we are dealing with 2 variables Order placement area and order delivery)
2/ Transaction will be in cash or via card? (assumed that order number will be generated at end of bill transaction acceptability! otherwise the latency will grew)
3/ I assume their is trolley packing area, and that will be asynchrous style packed by order packing management with customer order number, will that be assigned to deliverer, if yes in that case the thread is occupied and dependency is greater in order to deliver and come back to station!
cheers
waz
My viewpoint on same is that I respect the effort put by most performance testers in very constrained environment regardless of skill level. Over period of time most do develop bag of tricks that carry them through projects. So troubleshooting issues like application crash during recording, managing encryption, correlating large number of dynamic values etc are all appreciated.
The part that I have focussed on are the analytical skills. Core skills like pattern recognition, inductive reasoning, hypothesis development and design of experimentation etc are difficult to teach. After I established the analogy I ran it on one of my experienced resources. I was gobsmacked by his shanty analysis. Now he did reasonably OK in his projects and guess that kept the lights On, but it does creates pockets of analytical vacuum that someone else has to fill in.
I recall from many years ago working with a 'crack' performance testing team as short term contractor. There were numbers that made no sense as most were well below 1 second. Those were days of heavy EJBs and I had never seen nunbers like that and I made a lot of noise about it to investigate it further. The performance testing team sure was very talented and had good number of scenarios covered, with production like data, environment etc. Yet those basic inquisitive tendencies and analytical bent were missing. The article I wrote is just one more attempt to take a fresh look at performance concepts and view and analyze them through the prism of an analogy
Few point to add from my side:
1) fat client concept- making the client do some trivial job for the server, thus reducing server load. In restaurant case, it would be placing a soda machine, sauce counters, vending machines etc; like a self service.
2)using different interface to reduce load on one interface: ex using mobile/IVR to pre-order the breakfast & payment and the customer just needs to pick the parcel. This may work in our case as customer tend to order similar menu usually and also the restaurant will know how much food/inventory they need to prepare, by seeing the order quantity of the day.
3) localization of appln to reduce latency: people from one location should not travel 5-10 miles just to get breakfast instead an outlet can be opened in their area if the load is usually high from that area. we may investigate further on why the peak hour is at 8 - 8:30.
So this will increase revenue as one restaurant has a max limit of serving customer even if we keep on increasing the capacity.
4) Garbage collection: you already touched this point but just to add ; minor GC should be given more priority and should be more frequent & major GC should be avoided as much as possible during peak hour.ex: cleaning of plates/table vs cleaning whole kitchen/floor.
5) sequential job replaced by parallel jobs and resource sharing
ex: subways are example of sequential job (select bread->meat->veggies etc) and McD is an example of parallel as parallely orders are taken and made.
6) Batch job, you already touched: ex things like fries can be made in bulk and just kept heated for serving to many customer; other example items kept on counters can be pre-made.
7) bumping up the resources which is more utilized than other: ex: increasing the number of order/billing counters will have less improvement (as it takes 15 sec to take order but maytake 2-3 min to prepare) than increasing the work station of chef and no of chef. In appln terms most of the work & impact on response time is mainly by app & DB server rather Web server (whose job is to route req)
There are many more ideas/points in my mind but will save it for the day when we meet :)
1. Fat Client is intuitively helpful. It could help with wait minimization and resource efficiency. Where it could hurt is with deadlocks(soda counter could become new bottleneck), impede parallelization or impact resource availability if space crunch is one of current issues restaurant faces. Generally speaking I am inclined to completely parallel or nothing shared model where I want all customers to come to seat as quickly as possible and head straight to exit from there. Customers roaming around would introduce new points of serialization (at these places), interrupts (spilled soda or interfere with staff and serving trolley movement), round trips (back and forth to table) etc.
Pre-order, which is a little like pre-fetching, is a great idea and will work well if placed in time. Ordering at peak time doesnt help. I would think adding channel itself is not the help because in this case channel is not the bottleneck. Trade-off is that we have spread the work from peak to off-peak hours. We have to know how much spread would be achieved so that resource up accordingly. This impacts the working model of rest of the restaurant as it would need kind of staff and new queue management for those coming to pick up, again, possibly in short burst.
Localization, I would assume refers to distributed computing. I would want to think about cost. Would we have full serve and options branches at all places? It would depend of usage pattern. Consider Anakami, providing edge caching. Works for coprorates majorly having static pages. For dynamic traffic it would still have to be routed back to corporate servers. So if cookies and pretzels are all we want sell at remote counters, then by all means. Though remember redeucing customer travel time wasn't business objective for this exercise. We could probably achieve similar result by acquiring adjacent space in current location and maybe do even better if there are any synergies.
Subway sequential job is business take priority over performance. Performance does induce mass production and may be de-humanizing with conveyer belts, tokens etc. There is no reason why Subway could not build multiple parallel rows, but there is too much cost associated. Their business model is that for service customers can wait and they do.
With batches I was pointing to other extreme, where you would cause decrease in responsiveness. Consider when french fries run out during off-peak and you do want to have some orders before you cook again. Sure it is efficient, but for time conscious customers who would come early in off peak hours hoping to get quick service would be disappointed.
More resources should help if they are indeed the bottleneck.
Again no one answer as it is always contextual.
DevOps Training in Chennai
DevOps Online Training in Chennai
DevOps Training in Bangalore
DevOps Training in Hyderabad
DevOps Training in Coimbatore
DevOps Training
DevOps Online Training
oracle training in chennai
oracle training institute in chennai
oracle training in bangalore
oracle training in hyderabad
oracle training
oracle online training
hadoop training in chennai
hadoop training in bangalore