Home » VCDX
Category Archives: VCDX
Before I start, its been a while since my last post, mainly because I have been really busy with work and family. Now hopefully I will make it habit to post something useful once every few weeks.
Disclaimer: This is not the **official** recommendation from Nutanix on Cisco ACI. This is just something that I worked on for a client of ours and thought would be useful for anyone who might end up deploying Nutanix + Cisco ACI + new vCenter on Nutanix NDFS :)..
Problem Statement: Cisco ACI requires OOB access to vCenter to deploy the Cisco ACI networks as Portgroups in vCenter. Build vCenter on NDFS. NDFS needs 10 Gb fabric (ideally) from each node. All the uplinks in the leaf switch are controlled by Cisco ACI. But ACI needs vCenter to push out Management and VM Network VLANs.
Some of you might see this and go uh oh, but let me assure you, this also becomes a problem in non Nutanix environments as well; especially anyone using IP based storage and only have 2 x 10Gb adapters.
There are 2 ways in which we can take care of this.
Option 1: Deploy a vCenter in a management only cluster, which doesn’t depend on the Cisco ACI for Networking. (Needs seperate physical infrastructure for networking and for management cluster)
Option 2: Add another dual 10Gb Nic to each of the nodes. (becomes a lot more expensive when you think of tens of nodes x 4 10Gb adapters).
Both the above options are quite costly, be it from a networking physical infrastructure point of view or a management only cluster point of view.
So how do we go about solving this?
Simon Long (@SimonLong_) wrote a post here detailing the problem with assuming certain things. I see where he is coming from and I agree with his point. However, I also think there needs to be further clarification as to how to deal with assumptions. Most of the times I see assumptions being put in to ensure that the client provides certain basic infrastructure services, such as AD, DNS, NTP etc. These should ideally be called Pre-Requisites, since they are critical for any VMware or non VMware Production environment to operate effectively. Don’t call them assumptions. Assumptions are used when you have little or no control over certain aspects of a project or with an application you are dealing with. (more…)
Its time again for some hopefuls to go through to defending their design.First of all, Congrats. Give yourself a pat of the back and a vH5 (virtual High 5). You’ve just reached base camp. The hike gets harder from here. From here on its all about how well you’re prepare to weather the storm. It can be a great day with sunshine and chirping birds or all hell might break loose. Its all dependent on you.
Being there a year or so ago myself, I feel the need to address how one should prepare for VCDX defence. There are plenty of blogs which tell you what to do and what not to do. Plenty of videos on youtube / vBrownbag etc. This list is something that I had with me originally. Its not a lot but it something that got me started.
There was this awesome session abstract that was submitted for this years VMworld by Will Huber (@huberw, VCDX #81) and Tim Gleed (@timgleed, my manager 🙂 ) titled “How to lose a cloud in 10 days”. I would’ve loved to hear from these 2 guys what they thought were critical mistakes for a cloud environment. But unfortunately it wasn’t the case, the session was not selected for VMworld. So here is my take on that session:
Its been a while since my last post. Truth be told, I have been busy with work mostly during the last few weeks. But my little one has been having some medical issues owing to which I havent been able to socialise as much or spend time on blogging even though I have a backlog of articles in my drafts 🙂
Now lets look at this. We all know of companies who start with a POC for a product or a technology and then as mysteriously as it can be, it will turn into production at the snap of a finger. It is never OK for a POC to turn into production. As long as there is an architect who is worth thier salt, they wont let it happen. Now lets look at reasons why a cloud POC or specifically a Hybrid Cloud POC should never be ‘productionised’. (more…)
I have been seriously thinking and prepping for #VCDX-Cloud. It couldn’t have been more different to think about CMA than when I was starting with prep for VCDX-DCV.
Having said that, came across an interesting discussion on Twitter yesterday where a few guys I know and a few I know on Twitter (you all know who you are) were discussing which VCDX stream should one be focusing on right now. (more…)
Disclaimer: I intend no disrespect to any VCDX or CCAr or any high level Vendor Certificate holders !! Rene no disrespect to you either my friend.
I just recently saw a blog post by Rene (VCDX 133) about PhD in infrastructure design and discussions about it on twitter by Mike ‘Webscale’ Webster and Frank Denneman. This is what prompted this blog post.
Here is a link for the original post : wp.me/p4znjP-U6
I recently read a blog by Josh Odgers about the the requirement for hardware support specifically availability of the storage controllers or lack thereof (link: http://www.joshodgers.com/2014/10/31/hardware-support-contracts-why-24×7-4-hour-onsite-should-no-longer-be-required/). So I wanted to share my experience with storage controller availability and how modern storage systems provide availability as well as performance. I have used examples of system I have worked on extensively (XtremIO) and also other vendor technology that I read up on (EMC VMAX and Solidfire).
Nutanix have a very good technology of “Shared Nothing” architecture. But not everyone uses Nutanix (no judgement here). I have also been told that companies mix and match vendors (whoever that is :P). Josh raises a couple of very good points with regards to the legacy storage architecture of having 2 storage controller processing various workloads in todays high performance low latency requirement world.
There are a few exceptions to the rule above, EMC VMAX, EMC XtremIO and Solidfire. All these Storage systems have more than 2 storage controllers. They all provide scale up and scale out architectures. (Note: If I have missed any other vendors with more than 2 storage controllers, let me know and I will include the same in the post). I think the new age storage systems should not be called SANs, because they are not similar to the age old architecture of SANs providing just shared storage. These days storage systems do some much more than just provide shared storage.
VMware acknowledge this and hence they are introducing vVols, which provide a software definition to the capabilities provided by the Storage Systems. Hyperconverged is easily the latest technology which in some cases is definitely superior to the legacy SANs, but its not doing to replace everything just yet.
Lets delve deeper into the process. Lets take the legacy storage architecture and see how it behaves with scale out /up and failure scenarios.
Lets say a new SAN has been provisioned for a project, which has a definite performance requirement and so has been commissioned with limited disk arrays. This is an active active array so both the controllers are equally used.
As you can see, the IOPS requirement is being met 100%, CPU and memory have an average utilisation of 20-25%. So, this carries on for a few months, when another project starts with more IOPS requirements and so more disk is added (traditional arrays: more IOPS = more disk).
As you can see, the average utilisation of the Storage controllers in the SAN has spiked to about 45-50% on both the controllers. So after a few months, another project kicks off or the current project scope is expanded to include more workloads, you can see where this is going. Lets say that the controllers are not under stress and are happily peddling along at an average 70% utilisation.
BANG ! One of the storage controller goes down due to someone or something being wrong.
So whats happened here is, until the faulty part is replaced, the IOPS requirement can’t be met by the single surviving controller, thereby spiking the CPU and memory utilisation so high that processing anymore becomes impossible.
This is where the new age Storage Systems (see I am not calling them SANs anymore) have the upper hand. Let me explain how.
Lets take an EMC XtremIO for example, each XtremIO node consists of the following components (you can also read about XtremIO in my previous blog posts).
XtremIO is made up of 4 components, XBrick, Battery Backup Units, Storage Controllers and an Infiniband Switch. The Infiniband Switch is only used when there are more than 1 xbricks. Each xBrick node consists of the Disk Array unit (25 eMLC SSD) with a total of either 10 or 20 TB of usable storage. That is before all the dedpulication and compression algorithms kick in and make the usable space close to 70TB on the 10 TB cluster and 48 TB on the 20 TB xBrick.
You can’t add more disk to the node, if you want to add more disk you HAVE to buy another XtremIO node and add it to the XMS Cluster. When you add more than 1 node to the cluster, you also get an Infiniband switch through which all the storage controllers in the Storage System communicate.
The picture above shows the multiple controllers in the 2 node XtremIO cluster. (Picture from Jason Nash’s blog). This can be scaled out to 6 node clusters and no limit of how many cluster you can deploy.
Each Storage Controller has dual 8 core CPUs and 256 GB of RAM. This by any means is a beast of a controller. And all the metadata of the system is stored in memory so there is no requirement ever to span the metadata into the SSDs. The traditional way of writing metadata is when the storage disks are expanded with multiple disk trays, the metadata is also written into the spinning disk, this not only results in the read or write of the metadata being slow, it also consumes an additional backend IO. When there is a requirement for thousands of IO, the system just goes into a deeper state of consuming more IOPS to read and write metadata.
So lets take the example from above, where during the second stage of the project lifecycle, more IOPS were required, if the space was a constraint, the additional xtremio node is going to double the amount of IOPS that will become available as well as providing an additional 70TB of logical capacity.
Even though there is still an effect on the surviving Storage Controller, the IOPS requirement is always met by the new age Storage Systems. This is partly due to the fact that there is a specific improvements made to the way metadata is accessed in these new age systems. Lets look at the way metadata is accessed in traditional systems.
As you can see, meta data is not just in the controller memory but also dispersed across the spinning disks. Regardless of how fast spinning disk is, its always going to be slower than getting metadata from the RAM.
Now lets look at how meta data is distributed in XtremIO.
As storage requirements expand, more controllers are added in whose memory metadata is stored.
Other Storage Systems
If we move away from XtremIO and take the EMC VMAX as an example, each VMAX 40k can be scaled out unto 8 engines. Each of these engines has 24 cores of processing power and has 256 GB of RAM. It can be scaled up to about 2 TB of RAM and 192 cores of processing for all 8 engines. There are a maximum of 124 FC front end ports across the 8 engines.
Another example of a very good storage system is SolidFire. Solidfire has scale out architecture across multiple nodes and scale up options for specific workloads. They start from about 64 GB RAM and end up all the way upto 256 GB of RAM.
So here we go, traditional SANs are few and far in between today. There are various kind of companies, who for various reasons use all kinds of vendors. While #Webscale is taking off quite well, Storage systems still have a place in the datacenter. And as long as there are storage companies start re-inventing storage systems, they will remain in the datacenter along side #Webscale.
PS: Before anyone says zoning is not mentioned, I will tackle how to zone in the next blog post or may be after I work out how to explain zoning. I am not usually involved in zoning but will find out and blog about it as well.
For actual performance white papers, please visit the appropriate vendor websites.
Disclaimer: These are my personal thoughts and not the representation of any of the vendors mentioned in the blog below. If I have misquoted a technological aspect or misrepresented any vendor, please let me know and I will either remove it or modify it. All information provided here is as per my understanding and are not statements or facts from the vendor, for accurate support statements please consult your respective vendor.
I have been told multiple times before I submitted my VCDX Design and also during VCDX Boot camps that using pre-built Converged Infrastructure or Hyper Converged Infrastructure might be harder to defend. This becomes a constraint in my mind but definitely not something thats a show stopper.
To truly architect a solution, one has to know how to architect a solution not just with one technology, but be able to swap the technology or product on the go and still achieve the same outcome. This is especially true in case of a VCDX Design submission, as you will be asked for alternatives, around technology/hardware stack that you have used. If you don’t know the alternatives are for your solution, you better read up.
I defended my VCDX design that was based on VCE vBlock, so I speak with experience when it comes to this point. It was hard, but I am not sure it would have been different if it were any other hardware provider. At the end of the day it was only one of the constraints, my solution design met all the requirements of the client and that’s what matters the most. I am sure the same requirements would’ve been met if it was HP, IBM, Hitachi, NetApp or any other solution out there.
This post is not meant to deter people who are working on their VCDX designs based on vBlock or Nutanix or FlexPod or Simplivity or any other vendor. If you don’t know the reasons why your company chose a particular vendor, either speak to the decision maker or read up on proposals put together by the vendor.
You also need to know the pros and cons and possible substitutes for each solution, doesn’t matter what platform it is on. Every platform has its own challenges, from the business aspect to the technological aspect. One of the clients where I worked previously was so much against HP and EMC, they never even entertained RFQ’s from them. The same can be said about all the other vendors.
Lets get back to the point. 🙂 .
Here are a few points which will help you in defending your design based on CI or HCI:
- Know the pros and cons of each aspect of the solution. Even if you haven’t had the final word on any aspect, know why a particular decision has been made. If you don’t agree with it, raise it and get more clarification.
- Know the alternatives provided by other vendors for the same solution. This will definitely help you broaden your decision making abilities.
- Know the limits of what the CI or HCI can do and can’t do ( what is supported and whats not supported).
- A Solution Design should be repeatable and reusable infinitely (within reason). So get the scale up or scale out or up and out decision firm.
- Make changes, the whole premise about having a CI or HCI is a template to start off with. So make the necessary changes (within reason) wherever required.
- Know the operational aspects of the technology you are designing, an architects job is not finished after design. If it can’t be implemented successfully, it’s a failed design. (Similar to when a house crumples down, its not just the builder who messed up, it’s the architect too).
- Test all the facets of the solution and document where the outcomes were not as expected. Re-engineer and re-test until you get the desired outcome.
I think its true that defending a ‘real world’ design with Converged Infrastructure might be harder. Here are the reasons why:
- There are a lot of moving parts in the new age Infrastructure. You will have to design all the components individually and together. So each aspect has to adhere to the availability requirements and together with other components of the design as well.
- You are limited to the hardware options provided by that (Hyper) Converged Infrastructure provider. You are also limited by what each vendors self-imposed limits are. For example, VMware say that you can have 10,000 VMs powered on concurrently, the vendor having tested the theoretical maximum might cut it down to lets say 8000 VMs per vCenter.
- You can’t really swap lets say a VNX 5600 to VMAX in a lower model vBlock or from 2 SSD 4 HDD node to all SSD on the same node on Nutanix (This was more recently announced so it ‘may’ be possible).
- Scaling out or up or up and out is easy with both CI and HCI but becomes very expensive very fast if there is no control. Adding more hardware is never the solution for application related problems.
- If you didn’t do your capacity planning properly, going back to the project board for more money after the procurement is done is usually not something a project manager wants to do, regardless of what technology you use.
- In addition to this, there are multiple facets of the design that have already been decided, like the Recoverability aspect for example. Nutanix (I think) recommend using Veeam, whereas VCE use EMC RecoverPoint/Avamar/DataDomain products. You might not be aware of operational processes of either if you have been using Netbackup or TSM. So one should know what facets of the design are being influenced by using these products.
- When something like a vBlock or Flexpod or Nutanix or Simplivity has already been purchased, the business requirements that were given to the sales / pre-sales team don’t necessarily make sense from a technology point of view for a solution design. So the architect always has to go back to the Sales/Pre-Sales guy for confirmation on requirements as opposed to going directly to the customer. This sometimes can be a good thing, but mostly is not.
The advantage of using the prebuilt (hyper) converged infrastructure is that you architect the complete solution before the infrastructure is on the DC floor and everything comes in pre-built, pre-configured and pre-tested. So you are good to go in a few hours and start putting your workloads on the shiny new toy.
There are a lot of things that one needs to consider when architecting a virtualisation or cloud platform. To truly architect a solution, you should be able to achieve not just what’s in the VCDX Blueprint (although the blueprint gets you 70% there) but also ensure that the customer understands the nuances of using the technology day in day out. If the support staff is not trained in the particular technology, they will have to trained properly. User Experience is paramount when it comes to Virtualisation or Cloud Solutions, if it’s too hard you are doing it wrong.
I know this is going to get some very heated debates about platforms and which one has the better technology, but its not a post about whos better, these are my observations on defending a VCDX design when you are not the absolute decision maker on every single facet of the solution.