GrumpyStorage: TCO

Showing posts with label TCO. Show all posts

Monday, 3 May 2010

Hey EMC? Rhubarb! I say Rhubarb!

Now contrary to popular belief I don't actually like calling out specific things or people, but I'm afraid I feel compelled to call out something. I shall do this using a recent CloudCamp by-law convention established by @swardley on Feb 8th 2010...

For the publication of the "Savings from their IT Investments" press release I hear-by shout a claim of "Rhubarb!" firmly in the face of EMC and their PR team.

Now to be clear my call of "Rhubarb!" refers mainly to the specific 3rd paragraph, namely :-

Optimizing performance and cost reduction for Oracle Database 11g deployments with EMC Symmetrix® V-Max™ or EMC CLARiiON® CX-4 networked storage systems and EMC FAST to automatically adjust storage tiering as Oracle workloads change. This results in up to 30 percent lower acquisition cost and up to 45 percent lower operating cost in hardware, power, cooling and management over a three year period.

Additionally @sakacc also makes reference to this in the 5th paragraph of his blog post here re :-

Oh – we also showed how using Fully Automated Storage Tiering, Solid State storage, and deduplication using Data Domain we could lower the acquisition cost by 30%, and the operating costs of Oracle 11g by 45% – while delivering equal or better performance.

I'd like to believe these papers were created by engineers with good intent and sound principles, and I can certainly vouch for Chad having those qualities, and it is positive to see such materials being published, however...

I do notice that the two percentage claims above have subtle but key differences in their content - the first stating storage technology acquisition and (mainly) environmental costs, the second referencing 'operating costs of Oracle 11g'.

I've read the PDF linked at the start of the above paragraphs, and re-read it, and I'm afraid I can't find any reference to a number of things :-

The actual cost savings percentage values called out in the press release text
The baseline that has been used for the comparison savings statements above
Costs of components, technologies & software used - clearly showing the standard FoC and additional cost elements for each option (eg cost of FAST, cost of SSD etc)
Any form of ROI and TCO models used to underpin the two statements in the document making reference to 'improving TCO'
The list of assumptions and/or pre-requisites used in the models re savings (re facilities costs, FTE costs, frequency & duration of administration or change tasks, support & deployment operating model RACI etc)
Savings for other alternatives (eg using thin-provisioning & wide-striping on the vmax rather than FAST, using a CX instead of a VMax, using both a CX & VMax etc)
Impacts of other technologies (eg Oracle on NFS, Database compression, Flash Cache etc)
Impacts of the use of two independent arrays and ASM "NORMAL REDUNDANCY" for the replication of data between the arrays (ie rather than the cost of SRDF), and then how the independent operation of FAST on each array may impact performance predictability under disk failure situations. (with different read & write IO profiles potentially driving the independent policy engines into different decisions)
Some context definitions as to what capacity the paper regards as 'large' Oracle databases (eg 10TB? 50TB? 150TB? etc) (page 5 of pdf)
How this is impacted by the capacity of the databases being handled by the storage structure?
Impacts of the rate of database capacity growth on the proposed model (eg a rapidly expanding DB, when the DBs exceed the capacities of each tier etc)
The RPO & RTO requirements & assumptions for these databases, and impacts / sensitivity of changing them
The support operational SLA context for the database services (eg permissible time for response, resolution and return to normal operation, performance & risk after an incident)
Any specific performance related numbers (eg actual response ms times required for SLA, throughput/sec, IOPs etc), and how they change (given we don't know much about the transaction being measured)
No details to the NFR (performance, resiliency, capacity etc) impacts during the FAST migrations, or indeed how long the migrations took to complete each time
A slight puzzle for me is the use of Raid 5 3+1 in the vMax, given all the previous statements from EMC re default preference for Raid 6
The on page 30 of the doc the SSD raid type is stated as R5 7+1, versus R5 3+1 of other disk types (impacts on performance, risk & perf deg during rebuild?) but at the bottom of figure 25 & middle of figure 26 the screen-shots appear to show the SSD as being R5 3+1?
I'd be interested in seeing how including the pool allowed to use the SSD would change the performance - given a lot of Oracle DB perf issues come from redo log bottlenecks
The more likely use-case for me would be migration of storage within a single database's data structure (eg some of the database data-files active some inactive etc hence some on SSD, some on FC and some on SATA for the same DB)
As a minor side point it doesn't mention the version of ASM being used
Sadly as usual with EMC there are no actual details of the reference benchmarks being used for the workload simulation - really would be good if they published their suite of benchmark tools
The paper makes reference to "we used an internal EMC performance analysis tool that shows a 'heat map' of the drive utilisation of the array back end" - why isn't this confidence validation view & tool available to all customers as part of all array type standard software? (after all at worse it would be a sales tool to help justify the purchase of the additional FAST software licences???)

Now I'm not saying the reports are wrong, but what I'm rather saying is that I think they are incomplete, and appear to have little or no direct linkage to the claims being made in their name by the PR teams and certainly don't give a full context picture. This additional context & information is needed for enterprises to get sufficient comfort in the technologies to perform their own benefit opportunity & impact assessments.

So marks out of 10? So far 6/10 with a caution for "unjustified PR marketing abuse" - but very willing to review the score upon someone pointing out what I may have missed, a revised draft, justification of claims & clarification...

Sunday, 25 April 2010

Feature stacks and the abuse of language

So the new financial year heralds the season of vendor conferences, and - as night follows day - over the horizon, like the four riders of the apocalypse, approaches the associated marketeering storm that always comes with such conferences.

Sadly one trend I'm seeing more of from the (increasingly desperate?) IT infrastructure industry is aspirational future feature stacking. Where endless features are announced haphazardly into the mix in an attempt to justify new revenue streams; naturally the delivery of these features is in a different year/decade to when they are announced, let alone when any actual benefit might be realised.

Of course the first challenge to this is trying to convince customers re the vital importance of features they haven't heard of before, often for problems they never knew they had. So some use fictional stories in order to try and paint a picture of utopia as a result of paying for their magic liquor, some just plaster the industry with noise, others use new abuses of marketing terms, some use all.

A common area glossed over is the 'initial ingress disruption' required to achieve such utopia features - especially given the likely useful life of 'nirvana function'© versus the duration of benefits case and lifetime of said feature.

The benefit's case is an interesting point in it's own right - remember these are the vendors that often still haven't a clue about the TCO or ROI for their products several years after they were announced. Naturally there is little or no mention of the financial costs involved, ingress & egress disruption, organisation & technology process changes, operating model changes, and increasingly, the business process changes needed to use this fictional future widget function.

Now you wouldn't expect otherwise, but of course there is little mention of either the existing abilities to solve this problem other ways, or that the effort & resources might be better invested elsewhere (ie higher up) in the technology solution stack? Or that the symptom cold be avoided entirely if the cause were addressed with better application design. My view has always firmly been that infrastructure can provide at best single digit % improvements, where as changes in the application layer can provide double digit % improvements.

Always just snow-ploughing the data problem symptom around rather than addressing the cause - of course you can't fault the bottom feeding tin vendors from offering this solution, there is always some legacy application that can benefit from any improvement; but frankly the infrastructure companies don't have many other options and there is always somebody that'll buy anything.

So there's plenty of noise, lots of definition and understanding confusion & plenty of widget functions, indeed it's nothing new for companies to start abusing words and terms in a desperate hope to generate excitement and differentiation - yet normally this just further confuses the market (remember when a word typically had one clear, obvious and innocent meaning???).

Some recent history of definition & language abuse could be :-

'Cloud' NIST has worked to a certain extent but IT companies have abused the hell out of it.
'Virtualisation' has some common understanding in the server world, but as usual the storage world is chaos.
Now along wanders 'Federation' as the latest word to be put through the hype & definition mangler.

I'd really encourage the use of the relevant standards body to help create common industry definitions for the terms used, always provide clear & transparent context and always detail the assumptions & pre-requisites with any form of benefits discussion. Rather than using hypothetical stories and definition abuse, I'd really much rather prefer it if companies explicitly list the: -

The specific customer requirements & problems this addresses & justify how
The use cases this feature / function applies to, and those that it doesn't
Why & how this feature is different to that own vendor's previous method for solving this problem
Provide clarity over the non-functional impacts of the feature before, during & after it's use - ie impact on resilience, impact on performance, concurrency of usage etc (including provide up-front details of constraints)
Provide the before & after context of the benefit position, clearly explain the price of the benefit change and any assumptions or prerequisites needed to use the feature
Provide some form of baseline & target change objective for entire process steps impacted
Confirm the technology costs and cost metric model for this feature
Naturally you'll also expect me to require TCO & ROI of the feature, and any changes to the models as a result of this feature

To take an example, one key element being touted by 'federation' is 'non-disruptive migration' - something I'm very much in favour of. However a) for many this can already be done through the use of the de-facto volume manager & file-systems, but b) the real issue associated with migration are 'remediation' and CABs. With most CABs nowadays been based on risk, and commonly used as process validation gates - it's hard to understand how 'federation' helps change approval boards (especially when you consider that lots of CABs still require engagement for moving hypervisor guest images). For the 'remediation tech refresh' use case of federation there will need to be a lot of changes in the vendor support & interop processes, culture, responsibilities and agreements for this to be of use. If the host still requires any material remediation (eg HBA change, firmware changes, OS patches, server model change, VM/FS changes etc) then moving the bytes stored on the rust, whilst good, does little to address the majority of the problem. Let's not forget all the other associated OSS processes that have to be engaged - eg ICMS/CMDB updates, asset & license management registers, alert & monitoring tools, networking planning & bandwidth management etc. Yes in the world of the automated dynamic data-centre these related issues will be improved, but that's a future state after a lot more of investment & disruption.

If this sounds overtly negative that isn't the intent. The issue for me is that any 'nirvana function'© is normally only of use if it makes a net positive change to the cost of BAU service or change. In order to prove that we need to understand how it impacts the steps, effort & duration for each item in the transition from 'desire to delivery' (eg when somebody thinks they may need some capacity to when they are able to actually use this). From my experience this sequence involves a mix of commercial, technical, political, emotional & financial steps - similarly very few companies seem to be able to show the steps in this sequence and how their function changes them.

Now I'm very much one for focusing on capabilities and architectures rather than point widget features, but the current trend of announcing aspirations as architectures and then products is a very dangerous and steep curve downhill. Like an iced wedding cake made from cards built on a sandy beach - this obsession with feature stacking promises everything but benefit delivery regularly lasts for only a few minutes before collapsing in an ugly mess.

Are suppliers hoping that by increasingly frequently hyping the shiny shiny baubles of the progressively distant future they will distract us from the factual reality of today? Remember today was the future of yesterday, and how many of the past's 'nirvana functions'© promised by these same ~~charlatans~~ vendors actually came half way true?

If only these vendors spent time & resources making the existing features usable, simplifying the stack, resolving the interop issue, given clear context and being able to actually justify their claims, rather than building their own independent leaning towers of Pisa from which they can throw mud at each other...

Wednesday, 14 April 2010

Large slices of pie do choke you!

So a new blogger called "Storage Gorilla" makes a few interesting and well reasoned points here about IBM's XIV (my views re XIV will be in a different blog post) - but a couple that jump out to me are the 'entry size' & 'upgrade size' points about half-way down the text.

Now anybody who's spent time working with me on my companies' global storage BOMs will understand that this is a major issue for me, and not something that is getting any easier. The issue is a complex one :-

The €/Per GB ratio becomes more attractive the larger the capacity within an array (as the chassis, interfaces, controllers & software overheads get amortised over a larger capacity) - however of course the actual capex & opex costs continue to be very sizeable and tricky to explain (ie "why are we buying 32TB of disk for this 2TB database??")
As the GB/drive ratio increases, the IOPS per individual drive stays relatively consistent - thus the IOPS/GB ratio is on a slow decline, and thus performance management is an ever more complex & visible topic
IT mngt have been (incorrectly) conditioned by various consultants & manufacturers that 'capacity utilisation' is the key KPI (as opposed the the correct measure of "TCO per GB utilised")
DC efficiency & floor-space density are driving greater spindles per disk shelf = more GB per shelf
Arrays are designed to be changed physically in certain unit sizes, often 2 or 4 shelves at a time
As spindle sizes wend their merry way up in capacity the minimum quantity of spindles doesn't get any less, thus the capacity steps gets bigger
Software licences are often either managed / controlled by the physical capacity installed in the array, or in some random unit of capacity licences key combination - these do not change re spindle sizes
Naturally this additional capacity isn't 'equally usable' within the array - thus a classic approach has been to either 'short stroke' the spindles or to use the surplus for low IO activity. However in order to achieve this you either have to have good archiving and ILM, or need to invest in other( relatively sub-optimal to application ILM) technology licences such as FAST v2.
Of course these sizes & capacities differ by vendor so trying to normalise BOM sizes between vendors becomes an art rather than science

So what does this all mean?

Inevitably it means that the entry level capacity of arrays is going up, and that the sensible upgrade steps are similarly going up in capacity.
We are going to have to spend more time re-educating management that "TCO per GB utilised" is the correct measure
Vendors are going to have to get much better at the technical size of software & functionality licensing that much more closely matches the unit of granularity required by the customer
All elements of array deployment, configuration, management, performance and usage must be moved from physical (ie spindle size related) to logical constructs (ie independent of disk size)
Of course SNIA could also do something actually useful for the customer (for a change), and set a standard for measuring and discussing storage capacities - not as hard as it might appear as most enterprises will already have some form of waterfall chart or layer model to navigate between 'marketing GB' through at least 5 layers to 'application data GB'
Naturally the strong drive to shared infrastructure and enterprise procurement models (as opposed to 'per project based accounting') combined with internal service opex recharging within the enterprise estate will also help to make the costs appear linear to the business internal customer (but not the company as a whole)
The real part though will be a vendor that combines a technical s/ware & h/ware architecture with a commercial licence & cost model that actually scales from small to large - and no I don't mean leasing or other financial jiggery pokery

So I wonder which vendor will be the first one to actually sit their licensing, commercial & technical teams all together at the start of a product's development, then talk with & listen to customers, and actually deliver a solution that works in the real enterprise to enable scaling from small to large in sensible units? I'm waiting...

Monday, 8 February 2010

TCO - Why is it so hard for some?

Now per as my previous blog entries show-me-money-information & tco-time-for-opensource-framework discuss, I have to make architecture & standards changes & decisions based upon tco & roi calculations. Accordingly we require vendors to be able to demonstrate to me that they understand the TCO/ROI of their products & architectures (I have my views but need to understand theirs and validate / align forecasts), and of course provide me with copies of their models & values.

Okay, so 4 months ago I renewed my simple request to EMC for a TCO model comparing DMX+CX with VMax (in essence to compare 'between box' vs 'within box' tiers). A simple enough request I thought - and one I made initially several years ago (at that time comparing DMX to DMX+CX), but never got anywhere. This time a specific project planning to purchase PB+ of capacity drove me to renew this request.

Now four months on and, despite me chasing, I still haven't received anything from EMC, nor have I even been given an estimate as to when / if I might see something. So I'm forced to conclude that Uncle Joe and the Elusive Mathematical Calculators are either: -

Ignoring the request
Don't understand (or care) about TCO & ROI, preferring to focus on leasing or 'regular technology refresh purchase justification business cases'
Aren't able to explain the customer value of their different products and architectures
Are hiding something
Prefer slick & vocal marketing to facts
Trying to hire somebody to work on the topic

To me nowadays EMC are a company of conflicts, some of the things they do (& have) are the best bar none, other things sadly are the worse. As @storagebod pointed out here social-climbing that EMC are indeed changing for the positive but I suspect not as fast or thoroughly as they/we would like. Which means unltimately, despite changes, they are still a monthly/quarterly financial engineering orientated engine with a sales & target structure that 'reverts to past form' when deals are being discussed.

Now EMC aren't alone in this, to compare this with how some other companies have reacted to similar requests :-

Netapp are sadly still trying to understand the question for a couple of years ago.
Cisco are still searching for unicorns to breed, and admitted at NetWorker2010 that it will be a couple more months before anything surfaces. I've been requesting the ROI & TCO of the California/UCS platform for over a year (yes well before it went public), so I'm mystified that nothing yet exists as a model.
However on the positive front, HDS immediately answered, providing David Merrill and his team, how arrived with a variety of information, models and reviews. Lots of dialogue and transparency, and a variety of TCO & ROI models provided. So the request is possible and some do understand.

What some companies (or parts of companies) still appear to fail to grasp is that the 1990s tactics of poorly marketing, shouting loudly, 'special for you today only' sales negotiations, 'influencing' ISVs or mngt simply won't work any more. Customers need more data & benefit forecast models nowadays in order to justify usage or purchase decisions, and I fail to believe that these models aren't used when a dev team is seeking approval to create the product in the first place! I'd like to be less cynical or disappointed, but without the information to support the claims from some vendors its hard not to be...

Monday, 21 September 2009

TCO - Time for an opensource framework?

So in my job I regularly see what each vendor claims to be a 'TCO model' - now funnily enough these normally show that the vendor's widget is much better than the competitor's other widget. Naturally each model has some elements in it that the others don't or places a certain weighting / emphasis on particular attributes that others don't.

Now this is a topic that is very close to my heart, as all standards and strategy changes I make in my company are supposed to be TCO based - with us not making any changes unless they improve own actual TCO. Naturally this breaks when vendors EOL products or TCO isn't the driver - but the principle is valid (although sadly a surprise to many people).

Now on @om_nick's blog here http://www.matrixstore.net/2009/09/17/defining-an-up-to-date-tco-model/ he reminded me that I had a draft blog on this, and that 'crowd-sourcing' such models can work quite well. There's plenty of good attributes listed so far on Nick's blog and I'm sure we'll all add many more as time goes on (I know I must have a good dozen or so TCO Excel models knocking around somewhere).

Of course I know that TCO isn't always the right measure, and that ROI or IRR can often be just as valid, but for lots of elements of infrastructure the first point of call is a TCO or CBA - and making those consistent would be a great starting point!

One thing I'm very sure about is that for each technology category there is more than one 'level' to measure a TCO at for different purposes, for example :-

Industry average TCO - ie what does a GB of data cost to store for x hours on average in the industry? (the analyst KPI - and product / vendor agnostic)

Estate average TCO - ie what does a GB of data cost to store for x hours in my company on average? (the CTO level KPI - and product / vendor agnostic)

Architecture average TCO - ie for this type of reference design (inc Function & NonFunctional Requirements) what does a GB of data cost to store for x hours in my company on average? (the architect level KPI) This is product / vendor agnostic and used for ROM costing and selection of an infrastructure architecture.

Category average TCO - ie for this class of product (eg modular storage, enterprise storage, small x86, medium unix etc) what does a GB of data cost to store for x hours in my company on average? (the catalogue level KPI) This is now technology 'class' specific, but still product / vendor agnostic, and is used for building up the ROM & architecture costs above.

Product TCO - ie for this specific vendor product & version what does a GB of data cost to store for x hours in my company on average? (the product level KPI) This is now product and vendor specific, and is used for selecting product within a category (ie direct product bake-offs).

There are many tricky parts in a TCO model, including :-

What to measure? (both what is desired, and what is actually possible over time)
How to measure?
Where to measure?
How frequently to measure?
What relative weighting to give?
What TCO output KPIs to give? (eg € per GB, GB per kW, € per IOP etc)
How to communicate such KPIs without creating dangerous context-less sound-bites for people to abuse (ie my absolute hatred of the phrase 'utilisation' - it's utterly meaningless without context!)
How to ensure transparency & clarity over assumptions and driving inputs?
How to value / compare functionality when no direct equivalents?
How to handle 'currently familiar' or 'keep same' (ie low cost of introduction) Vs 'new vendor & widget' (ie disruption & short term duplication of costs / disruption etc)?
How to handle 'usefulness'? (eg performance is a NFR than has value - does 'IOPS per GB per €' work?)
How to build a feedback loop and refinement model to periodically measure and validate TCO predictions Vs actuals, and take action accordingly?
How to protect confidential or sensitive values?

Getting a common list of assumptions, factors and attributes, relative weightings and of course values for all of these is the absolute key and a very valuable exercise for all - customer and vendor alike.

Lastly - one company with a very interesting approach to TCO mngt is www.Apptio.com who provide a SaaS model for building & maintaining an automated TCO measurement and reporting platform - it's certainly sparked interest in my mind, would love to hear more about people's thoughts about or experiences with them.

Now I for one am more than up for spending time on creating an 'open source' TCO model that has many people's input and thoughts into it, that we can refine and revise over time and use to evaluate many vendor technologies - so what do other people think?