Uncategorized
Why does CORE fail? Part 2
… Continuation of my previous post on CORE deficiencies and how it could be improved upon.
What is CORE?
Let’s look at originally defined CORE equation.
CORE = (S x R x V)/(C x tc)where,
S = The capacity being reduced in TB. Dave in his post fixes the S value at 100TB to compare all solutions.
R = The percent reduction achieved. Dave shows the R value in decimal for different solutions, we can assume though R is described as percent reduction, decimal R is used in calculating CORE.
V = The value of capacity being saved. Though, Dave doesn’t list the V values used for different solutions, it is not difficult to reverse-calculate this value using other parameters listed in his table.
C = The cost of solution doing the reducing.
tc = The elapsed time to compress the capacity. As covered in my last post, I consider this parameter to be stated incorrectly, incorporated inappropriately and irrelevant to the CORE. In place, a better parameter would have been the elapsed time to write.
Three things stand out in this CORE equation:
- CORE equation assumes first-order relationship with its variables. It may seem that for a specified value of S, the high CORE score can be achieved by achieving high data reduction (R) and the value of capacity being saved (V) and reducing the cost of solution (C) and the time to compress (tc).
- CORE equation has variables (S, R, V) in numerator that are normalized for solutions without data reduction but no such adjustment is made for variables (C, tc) in denominator.
- CORE equation is composed of dependent variables instead of independent variables.
Isn’t V dependent on S and R?
V = Sr x Ct = S x R x Ct
Substituting V in original CORE equation,
CORE = (S^2) x (R^2) x Ct / (C x tc)
To a large extent, this modified CORE equation is composed of more independent variables than original one. Obviously, it is no longer a first order relationship with S and R.
What is interesting with CORE equation is that amount of data reduced has been included twice, once as amount of data reduced and then again as part of cost of amount of data reduced.
What is the CORE value for a solution with no data reduction technology?
For,
S = 100 TB,
R = 0% as there is no data reduction,
V = $0 as there is no capacity being saved,
C = 0 as there is no data reduction technology in play so there is no cost of data reduction solution, and
tc = 0 ms as there is no compression of data taking place,CORE = (S x R x V) / (C x tc) = (100 x 0 x 0) / (0 x 0) = 0/0
CORE = 0/0 (indeterminate) this expression has no meaning.
You may agree that a relevant CORE value for a solution with no data reduction technology should be 0 or 1. It also makes sense in calculating value of a data reduction solution to have a solution with no data reduction as baseline.
How can we avoid division by zero?
1. Replace tc with tw or (tw + tc)
An equation that takes in to account time to write (tw) instead of or in addition to time to compress (tc) could help avoid division by zero when there is no compression/deduplication being used as even baseline solution with no data reduction will have a non-zero time to write. Either tw or (tw + tc) will be a better choice in place of tc in original CORE equation.
2. Redefine tc and tw
Of course, as originally defined in Dave’s post, tc is time to compress the smallest unit compressed in the solution (e.g. file or multiple files or blocks) which ignores the variation in tc due to variation in the size of smallest unit across various solution. I recommend changing the definition of tw and tc, respectively, to time to write and to compress S amount or certain % of S, the value of S should remain same across all solutions. This will remove the parameter dependency on smallest unit compressed and normalize parameter across same amount of S.
3. Redefine C
As originally defined, C is the cost of data reduction solution. As Dave’ post indicate NetApp doesn’t charge for ASIS – we took a percentage of the array’s cost, we can safely assume that C is only the cost of data reduction part of the solution, and not the whole solution. In this scenario C = 0 for a solution with no data reduction, thus making CORE value indeterminate again.
An equation that takes in to account the total cost of solution, i.e. cost of solution with no data reduction plus the cost of data reduction solution will help avoid division by zero. Of course, for a data reduction solution that uses existing storage, the total cost of solution will be net present value (NPV) of existing storage plus the cost of data reduction solution. Even better, subtract cost of capacity saved (V) from this cost instead of using V in numerator will result in Net cost of solution.
A better CORE equation, may be?
CORE = (S x R) / (C x tw x tr)where,
S, R and V are same as originally defined.
C = Net Cost of Solution = Cost of data reduction solution + Cost of capacity used after reduction
Cost of capacity used after reduction = S (1 - R) x Ct = (S x Ct) - (S x R x Ct) = (S x Ct) - V
tw = time to write a pre-defined storage capacity or fraction of S
tr = time to read a pre-defined storage capacity or fraction of S
Of course, some may object to not including read/write ratio, there is no reason why read/write ratio shouldn’t be included.
In the end, a CORE equation that is function of Storage Capacity (S), Percent data reduction (R), Net Cost of Solution (C), Read/Write ratio, Time to write (tw), and Time to read (tr) will be more valuable than the originally defined CORE equation. Of course, a lot more work is required to determine the interdependency of these variables.
Why does CORE fail? Part 1 - Response
Steve Kenniston of Storwize made detailed comment in response to my last post Why does CORE fail? Part 1. I thought my response to his comment deserved a separate blog post. Frankly, I haven’t kept up with developments at Storwize since May 2007 when I last wrote a series of blog posts on Storewiz so I don’t claim any knowledge of current Storwize solution.
First, I am not so sure that time to ‘uncompress’ … is a valid parameter IF all solutions are being compared identically,….
The time to decompress/reconstitution is as much important, if not more, than time to compress/dedupe. The compression/deduplication can be managed ‘internally’ to keep up with write expectations of applications and users whether through delaying writes just enough to allow data reduction in-band or through data reduction after writes complete or some hybrid approach. But, the read expectations must be met in-band so any decompression/reconstitution need to take place correctly and completely in the expected time. A solution that requires lower time to decompress should be rewarded in same fashion as a solution with lower time to compress being rewarded in CORE.
… First I think we can all agree that decompression or rehydration is faster than optimization (compression, deduplication). … the performance of time to ‘compress’ (I prefer optimize) and then cut the time in half and call this time to rehydrate. Now apply the formula. I would assume that the new CORE value would come out very close as they are now.
I am not so sure of time to decompress/reconstitute being faster than time to compress/dedupe or being 50% of time to compress/dedupe as I haven’t heard of a solution or seen data yet that supports such claim. Actually, the relationship may be reverse specially for solutions with large amount of compressed/deduped data and high data reduction ratio. Only related published data, I am aware of, is that of read speed being direct function of the smallest unit used for decompression/reconstitution - larger the unit size, higher the read speed.
As I questioned in my last post, are time to decompress and compress proxy for time to read and write from data reduction solution? If it is the case, CORE could be improved upon by including actual time to read and write (instead of time to decompress or compress) or including time to decompress/compress as penalty over normal read/write with a solution that has no data reduction technology - in essence, additional cost in the form of lower read/write performance in exchange for higher storage efficiency.
Also, without understanding how the solution works it is very difficult to debate the merits of the value of performance on that solution. …
If CORE stays with the parameters that can be judged externally for a solution, it will be more relevant and valuable than trying to incorporate parameters internal to a solution like time to compress (tc). A CORE based on externally measured parameters like reduction ratio, read and write performance, and cost of solution over a range of storage capacity and time may produce a better value indicator. Any attempt to include internal mechanisms weakens the CORE due to lack of complete information and understanding of every solution and rapid changes in technology and techniques incorporated in such solutions.
How can you possibly say that a post process solution that has users: 1) Buy full storage capacity (vs. less capacity with an inline solution) …… is a good solution? …
Please read my post again. I never claim any one solution is better than other. CORE includes cost of solution as a parameter which supposedly should penalize the solution that includes more storage than required by other solutions.
Step out of the vendor shoes for a moment and put yourself in the shoes of the customer. Which would you want?
As a customer, I want a solution that will provide additional storage efficiency at reasonable cost while meeting my expectations for read and write performance, safeguards my data and doesn’t require additional management overhead. Anything beyond that is vendor coloring the customer expectations to fit it’s solution.
Why does CORE fail? Part 1
Recently, David Vellante at Wikibon wrote in his blog post Dedupe Rates Matter … Just Not as Much as You Think about his Capacity Optimization Ratio Effectiveness (CORE) value for ranking dedupe/compression/capacity optimization solutions. He also applied CORE to few dedupe solutions for primary storage.
As I commented on his blog, right away I noticed that CORE formula had an important parameter missing - time to uncompress/reconstitute (hereafter referred as time to uncompress) deduped data. It is an important parameter that impacts the rate of reading data from dedupe solution by applications/users. As time to uncompress need to be happen inline for both inline and post-processing solutions, logically there will be no major discrepancy in using time to uncompress and reading data from a dedupe solution interchangeably.
Any Vendor Strategy, why not?

Initially, I was going to post a comment on Chris Evan’s recent post 2V or Not 2V (vendors this is). With the increasing length of the comment, I decided to turn it in to a blog post of my own. Chris succinctly covered the operational aspects and challenges of multi-vendor strategy.
The challenge is how deep do you go in your environment to have multiple vendors. Do you want to have multiple vendors for,
- only large items like storage subsystems?
- smaller stuff like HBAs and switches too?
- commodity type stuff that has little differentiation among vendors?
- specialized products?
Just because you have multiple vendors, doesn’t necessarily gives you $ bargaining power. Bargaining power comes with the transaction volume, transaction size, transaction frequency and your value to the vendor.
At the smaller end, though you can achieve better operational efficiency by standardizing on single vendor, you don’t have the volume and size for a single vendor to take you seriously. Unless by consolidating all your purchases you get the volume and size to be valuable to a vendor, why not just buy the best-of-breed solutions?
How much operational efficiency are you going to gain by buying three Clariion versus one Clariion, one 3Par and one Compellant?
At the high end, single vendor strategy hinders your ability to adopt innovation and new technologies with minimal gains in operational efficiency (remember large teams can be split among multiple vendors if needed) though you may be valuable to the vendor and get better pricing. How much operational efficiency are you going to lose by adding three 3Pars to couple of dozen AMS, you already have?
I have seen, heard and experienced enough horror stories to believe either single or multiple vendor strategy for any one organization is a right strategy. I favor Any Vendor strategy where your decisions are driven by the best solution that meets your need and not a solution from a pre-selected vendors that somewhat meets the needs.
Peril of Working in Cloud
Sorry! We are experiencing technical difficulties and cannot show all of your documents.

Do you have local backup copies of everything important you store in the Cloud?
Spreadsheet Miscalculation of 30!
Today, I encountered something amusing. I was trying to calculate Factorial 30 (also written as 30!). Don’t ask me why I was trying to calculate Factorial. :-).
As a refresher for some of us who may have forgot Factorials, Factorial 30 is product of all positive integers between 1 and 30 inclusive.
30! = 30 x 29 x 28 x 27 x 26 x 25 x 24 x 23 x 22 x 21 x 20 x 19 … x 1.
With Microsoft Excel, using both PRODUCT function and manual multiplication 30 x 29 x ….
30! = 265,252,859,812,191,000,000,000,000,000,000
Same results on my MacBook with Numbers program, using both PRODUCT function and manual multiplication.
30! = 265,252,859,812,191,000,000,000,000,000,000
I got slightly different result with Google Docs Spreadsheet, using PRODUCT function
30! = 265,252,859,812,191,030,000,000,000,000,000
And, another different result with Google Docs Spreadsheet using manual multiplication
30! = 265,252,859,812,191,100,000,000,000,000,000
But actually,
30! = 265,252,859,812,191,058,636,308,480,000,000
It appears spreadsheets are rounding numbers after 15 or 16 digits.
Adaptec Advisors are Back!
Adaptec PR firm sent a note mentioning that Adaptec Storage Advisor’s blog is back! Check it out.
I am also trying to get back to updating my blog after a long hiatus. Hopefully with some small and quick blog posts on regular basis, my writing habit will establish. In the mean time, enjoy the sights from my various trips.
How do you overcome writing drought?
Spreadsheet Miscalculation of 30!
Today, I encountered something amusing. I was trying to calculate Factorial 30 (also written as 30!). Don’t ask me why I was trying to calculate Factorial. :-).
As a refresher for some of us who may have forgot Factorials, Factorial 30 is product of all positive integers between 1 and 30 inclusive.
30! = 30 x 29 x 28 x 27 x 26 x 25 x 24 x 23 x 22 x 21 x 20 x 19 … x 1.
With Microsoft Excel, using both PRODUCT function and manual multiplication 30 x 29 x ….
30! = 265,252,859,812,191,000,000,000,000,000,000
Same results on my MacBook with Numbers program, using both PRODUCT function and manual multiplication.
30! = 265,252,859,812,191,000,000,000,000,000,000
I got slightly different result with Google Docs Spreadsheet, using PRODUCT function
30! = 265,252,859,812,191,030,000,000,000,000,000
And, another different result with Google Docs Spreadsheet using manual multiplication
30! = 265,252,859,812,191,100,000,000,000,000,000
But actually,
30! = 265,252,859,812,191,058,636,308,480,000,000
It appears spreadsheets are rounding numbers after 15 or 16 digits.
Peril of Working in Cloud
Sorry! We are experiencing technical difficulties and cannot show all of your documents.

Do you have local backup copies of everything important you store in the Cloud?
Any Vendor Strategy, why not?

Initially, I was going to post a comment on Chris Evan’s recent post 2V or Not 2V (vendors this is). With the increasing length of the comment, I decided to turn it in to a blog post of my own. Chris succinctly covered the operational aspects and challenges of multi-vendor strategy.
The challenge is how deep do you go in your environment to have multiple vendors. Do you want to have multiple vendors for,
- only large items like storage subsystems?
- smaller stuff like HBAs and switches too?
- commodity type stuff that has little differentiation among vendors?
- specialized products?
Just because you have multiple vendors, doesn’t necessarily gives you $ bargaining power. Bargaining power comes with the transaction volume, transaction size, transaction frequency and your value to the vendor.
At the smaller end, though you can achieve better operational efficiency by standardizing on single vendor, you don’t have the volume and size for a single vendor to take you seriously. Unless by consolidating all your purchases you get the volume and size to be valuable to a vendor, why not just buy the best-of-breed solutions?
How much operational efficiency are you going to gain by buying three Clariion versus one Clariion, one 3Par and one Compellant?
At the high end, single vendor strategy hinders your ability to adopt innovation and new technologies with minimal gains in operational efficiency (remember large teams can be split among multiple vendors if needed) though you may be valuable to the vendor and get better pricing. How much operational efficiency are you going to lose by adding three 3Pars to couple of dozen AMS, you already have?
I have seen, heard and experienced enough horror stories to believe either single or multiple vendor strategy for any one organization is a right strategy. I favor Any Vendor strategy where your decisions are driven by the best solution that meets your need and not a solution from a pre-selected vendors that somewhat meets the needs.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « May | ||||||
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | |||
