Disclaimer: Though its based on a real customer, this posts intends no disrespect to any employees or anyone else working for that customer (if you know who you are). If anyone is offended by this I apologise but not for expressing my opinion !!
Automation is a good thing but there are certain scenarios where Automation to the Nth degree isnt always the answer. Its not about any competitor or partner. This is about a customer who wanted to automate the shutdown and deletion of idle development VMs. Not I want to click a button to ensure that someone oversees this during the initial phase. They want it completely automated.Yep you heard right. Automation is generally used to well automate menial tasks that you do day in day out to not only help you save time, but also to ensure that there is no manual error in the process. But when I was told this I immediately thought of
Before we jump to conclusions about what why and who the hell would do this, a little background on the issue. There is a customer whom I have been working with over the last few weeks who has a sizeable development environment. Now, they have had issues with the development team being well the dev team; Not utilising the resources that they have access to, VM sprawl, non-descriptive snapshots, general tardiness etc. The business wanted to change it and use a model to enable better use of resources by providing access for them to power-on VMs which have been shut down by an automatic process. Well the thought is a good first step. But automating shutting down of VMs is surely not the answer to this ‘business’ problem.
– the first thing that the IT team probably thought when they heard about this. Short answer – VERY!
To reiterate what is required is to ‘automate the process of reclamation of resources by shutting down idle VMs without any user involvement. Nope not even a button will need to be clicked. Here is where the trouble starts. How can you automate something so intricate and so disruptive (not in a good way) to the users. The following questions popped into my head immediately:
- What do you categorise as being ‘idle’ ? VMkernel does a pretty good job of not giving idle VMs any resources so why would you want to shutdown ‘idle’ VMs ?
- How will you stop the developers from scripting something into the VM to trick it not being in “idle” mode ?
- Why not educate the developers on a proper approval process for requesting and using VMs ?
- How can a baseline be attached to multiple systems ? How will it be ensured that the VMs which are shut down won’t affect the VMs which are still running ?
- What’s the application dependency and who’s is monitoring all the systems to ensure that there is no ripple effect ?
- Whose responsibility is it to ensure the consistency of this automated process and ensure that this will not target Productions VMs ?
As you can see it’s not easy automating something. Nutanix do it easily (although there must be hours and hours of QA and testing behind the 1 click process) because they know the dependency within their systems and the systems they support. To automate something completely is definitely not a good first step especially when it comes down to shutting down or powering down VMs. After all, the main reason of designing any VMware environment is to ensure that the VMs remain powered on. There needs to be a capability to audit who shut down the system. If it’s automated by a script or workflow, then it better be a damn good reason for it.
So lets look at the solution then, the solution we came up with (EMC + VMware team, you know who you are guys 🙂 ) was to call a vCO workflow from vROps which automatically shutdown the VMs; based on the alert created within vROps based on utilisation and idle time grouped by either application or any other kind. And oh yeah the workflow includes emailing the user that his VM has been shut down.
In my personal opinion, this whole thing was a complete waste of time as I doubt anyone would want to shutdown ‘idle’ VMs ; unless the VMs are licensed by the number of instances powered on. Hence the title of the post Automation is a good thing.. Generally speaking. I may be jumping the gun here in saying that “this is bullshit”. But then again I’ve never been a fan of having technology solve a problem that needs to be addressed by a business process or general user behaviour.
Thanks to Nathan Wheat (@wheatcloud), vCO GURU for this awesome workflow to enable the customer to use the automatic shutdown process. This is just a pictoral representation of the workflow and the actual code will not be made available. But this will provide an idea about using vROps and vRA to maintain the lifecycle of the VM in vRA.
Word of caution: If you plan on using this, ensure that you fully understand what the repercussions are of automating shutdown for idle VMs.