Home » Storage
Category Archives: Storage
I recently read a blog by Josh Odgers about the the requirement for hardware support specifically availability of the storage controllers or lack thereof (link: http://www.joshodgers.com/2014/10/31/hardware-support-contracts-why-24×7-4-hour-onsite-should-no-longer-be-required/). So I wanted to share my experience with storage controller availability and how modern storage systems provide availability as well as performance. I have used examples of system I have worked on extensively (XtremIO) and also other vendor technology that I read up on (EMC VMAX and Solidfire).
Nutanix have a very good technology of “Shared Nothing” architecture. But not everyone uses Nutanix (no judgement here). I have also been told that companies mix and match vendors (whoever that is :P). Josh raises a couple of very good points with regards to the legacy storage architecture of having 2 storage controller processing various workloads in todays high performance low latency requirement world.
There are a few exceptions to the rule above, EMC VMAX, EMC XtremIO and Solidfire. All these Storage systems have more than 2 storage controllers. They all provide scale up and scale out architectures. (Note: If I have missed any other vendors with more than 2 storage controllers, let me know and I will include the same in the post). I think the new age storage systems should not be called SANs, because they are not similar to the age old architecture of SANs providing just shared storage. These days storage systems do some much more than just provide shared storage.
VMware acknowledge this and hence they are introducing vVols, which provide a software definition to the capabilities provided by the Storage Systems. Hyperconverged is easily the latest technology which in some cases is definitely superior to the legacy SANs, but its not doing to replace everything just yet.
Lets delve deeper into the process. Lets take the legacy storage architecture and see how it behaves with scale out /up and failure scenarios.
Lets say a new SAN has been provisioned for a project, which has a definite performance requirement and so has been commissioned with limited disk arrays. This is an active active array so both the controllers are equally used.
As you can see, the IOPS requirement is being met 100%, CPU and memory have an average utilisation of 20-25%. So, this carries on for a few months, when another project starts with more IOPS requirements and so more disk is added (traditional arrays: more IOPS = more disk).
As you can see, the average utilisation of the Storage controllers in the SAN has spiked to about 45-50% on both the controllers. So after a few months, another project kicks off or the current project scope is expanded to include more workloads, you can see where this is going. Lets say that the controllers are not under stress and are happily peddling along at an average 70% utilisation.
BANG ! One of the storage controller goes down due to someone or something being wrong.
So whats happened here is, until the faulty part is replaced, the IOPS requirement can’t be met by the single surviving controller, thereby spiking the CPU and memory utilisation so high that processing anymore becomes impossible.
This is where the new age Storage Systems (see I am not calling them SANs anymore) have the upper hand. Let me explain how.
Lets take an EMC XtremIO for example, each XtremIO node consists of the following components (you can also read about XtremIO in my previous blog posts).
XtremIO is made up of 4 components, XBrick, Battery Backup Units, Storage Controllers and an Infiniband Switch. The Infiniband Switch is only used when there are more than 1 xbricks. Each xBrick node consists of the Disk Array unit (25 eMLC SSD) with a total of either 10 or 20 TB of usable storage. That is before all the dedpulication and compression algorithms kick in and make the usable space close to 70TB on the 10 TB cluster and 48 TB on the 20 TB xBrick.
You can’t add more disk to the node, if you want to add more disk you HAVE to buy another XtremIO node and add it to the XMS Cluster. When you add more than 1 node to the cluster, you also get an Infiniband switch through which all the storage controllers in the Storage System communicate.
The picture above shows the multiple controllers in the 2 node XtremIO cluster. (Picture from Jason Nash’s blog). This can be scaled out to 6 node clusters and no limit of how many cluster you can deploy.
Each Storage Controller has dual 8 core CPUs and 256 GB of RAM. This by any means is a beast of a controller. And all the metadata of the system is stored in memory so there is no requirement ever to span the metadata into the SSDs. The traditional way of writing metadata is when the storage disks are expanded with multiple disk trays, the metadata is also written into the spinning disk, this not only results in the read or write of the metadata being slow, it also consumes an additional backend IO. When there is a requirement for thousands of IO, the system just goes into a deeper state of consuming more IOPS to read and write metadata.
So lets take the example from above, where during the second stage of the project lifecycle, more IOPS were required, if the space was a constraint, the additional xtremio node is going to double the amount of IOPS that will become available as well as providing an additional 70TB of logical capacity.
Even though there is still an effect on the surviving Storage Controller, the IOPS requirement is always met by the new age Storage Systems. This is partly due to the fact that there is a specific improvements made to the way metadata is accessed in these new age systems. Lets look at the way metadata is accessed in traditional systems.
As you can see, meta data is not just in the controller memory but also dispersed across the spinning disks. Regardless of how fast spinning disk is, its always going to be slower than getting metadata from the RAM.
Now lets look at how meta data is distributed in XtremIO.
As storage requirements expand, more controllers are added in whose memory metadata is stored.
Other Storage Systems
If we move away from XtremIO and take the EMC VMAX as an example, each VMAX 40k can be scaled out unto 8 engines. Each of these engines has 24 cores of processing power and has 256 GB of RAM. It can be scaled up to about 2 TB of RAM and 192 cores of processing for all 8 engines. There are a maximum of 124 FC front end ports across the 8 engines.
Another example of a very good storage system is SolidFire. Solidfire has scale out architecture across multiple nodes and scale up options for specific workloads. They start from about 64 GB RAM and end up all the way upto 256 GB of RAM.
So here we go, traditional SANs are few and far in between today. There are various kind of companies, who for various reasons use all kinds of vendors. While #Webscale is taking off quite well, Storage systems still have a place in the datacenter. And as long as there are storage companies start re-inventing storage systems, they will remain in the datacenter along side #Webscale.
PS: Before anyone says zoning is not mentioned, I will tackle how to zone in the next blog post or may be after I work out how to explain zoning. I am not usually involved in zoning but will find out and blog about it as well.
For actual performance white papers, please visit the appropriate vendor websites.
Traditionally, when storage design was done for VMware Environments, a lot of criteria had to be considered. This included
- Number of Drives
- Speed of Drives
- Number of IOPS per each drive
- RAID penalty
- Write Penalty
- Read Penalty
- Scalability of the Array
But with the advent of all flash arrays (XIO, Pure, Nimble, Violin etc etc) a lot of these parameter no longer constraint the storage design for VMware environments. Each of the AFA offerings have their own RAID kind of technology, which pretty much guarantees a very high resiliency to failure and data loss. Also with the new kind of Flash drives introduced (eMLC from memory), the consumer level SSDs are no longer used in AFAs. So now that the physical limitations on the drives have been eradicated, lets look at the next steps.
Queue depth is a very misleading constraint, there are queue depths at each level, LUN, Processor, Array. So each physical enitity(or not so physical for CNAs and LUN) has an individual queue depth. How do we address this short coming? If there is a lot of IO being thrown at the Array, if its not able to process it, the queue is going to fill up.
If the host parameters are not set properly, it will start to fill up the HBA queue depth across the multiple LUNs that it has access to. Some of these parameters can be changed to ensure that the ESXi vmkernel process does things differently when using AFAs.
I’ve previously mentioned some parameters that need to be changed for XtremIO. I guess the same would apply for all the AFAs out there. Using ESXi with AFA and not changing advanced parameters to take advantage of AFA is like, buying a Ferrari to drive in Melbourne CBD. It only proves that you are an idiot, restricted by the ‘speed limit’.
OK But what about LUNs:
Now to the original question, One Big LUN vs Multiple smaller LUNS. Each decision has its own advantages and disadvantages, for example, choosing one Big LUN can give cumulative IOPS available across multiple storage nodes in AFA. So if one node provides 250,000 IOPS (random workload 50% read), then adding another node to it will enhance it to 500,00 IOPS. That single node provides more IOPS than a fully scaled and filled VNX 7500. Thats a lot of horsepower if you ask me.
The same can be said for multiple smaller LUNS, each LUN created is spanned (atleast in XtremIO AFAIK) across all the available nodes in the cluster. So you would still get the benefit of insane amount of IOPS for each decision.
There are other considerations that you will need to take into account when designing storage for VMware. To start with, workload consideration is a good one. Depending on the workload thats consuming all of these resources, you might want to provide a single big LUN or the application architecture might force you to use multiple smaller LUNS. One of my customers’ SQL Team is convinced that even on AFA, the data and the log LUN have to be separated on ‘spindles’. I explained about the lack of spindles and the redundancy/resiliency/availability aspect of AFA. After a long discussion, it was agreed that there would still be multiple LUNs created but all of them on the same 2 XIO node array. Not across the other 2 x 2 node XIO arrays that are available.
What about DR/SRM:
DR/SRM strategy doesn’t need to change significantly for SRM. I have always believed in providing the optimal number of LUNs for SRM for a mixed workload. Some applications might require a separate LUN (for a vApp for example). While some are happy to co-exist. It also comes down to the application owners, some application owners are adamant that the workloads should be maintained seperately, while others are happy to co-exist on the same LUN as long as their RTO/RPO requirements are met.
So in short, the answer is ” IT DEPENDS“. But my vote goes to multiple medium sized LUNs (10-12TB) :). This will provide the advantages of both big and small LUNS.
Whats your say ?
I’d appreciate the comments about this in the blog rather than on twitter, but then again both are social media so doesn’t matter.
Xtreme IO is the newest and fastest (well EMC say so) All Flash Array in the market. I have been running this in my “lab” running a POC which is quickly turning into a major VDI refresh for one of the clients. Having run throug the basics of creating storage and monitoring alerting etc in my previous posts., I am going to concentrate on what parameters we need to change in the vSphere world to ensure we get the best performance from Xtreme IO.
The parameters also depend on what version of ESXi you’re using, as Xtreme IO supports ESXi 4.1 + .
Without further delay, lets start.
Adjusting the HBA Queue Depth
We are going to sending a lot more IO through to the Xtreme IO array than you would to the traditional hybrid array. So we need to ensure that the HBA queue depth is allowing a lot more IO requests through.
You can find out the module by using the command
Step 1: esxcli system module list | grep ql (or lpfc for emulex)
Once you find out the module that is being used. The command below can be used to change the HBA queue depth on the server.
Qlogic – esxcli system module parameters set -p ql2xmaxdepth=256 -m qla2xxx (or whatever is the module from the command in Step 1.)
Emulex – esxcli system module parameters set -p lpfc0_lun_queue_depth=256 -m lpfc820 ( or whatever is the module from the command in Step 1)
If you are not going to use Powerpath, since its an active active X number of controllers array (yeah, i know its got 2 controllers per disk shelf so as of today you can scale upto 6 disk shelves per cluster so 12 controllers), we will be using Round Robin if using NMP.
The engineers who work with Xtreme IO recommend that the default number of iops be changed from 1000 to 1, yes “ONE”. So essentially you are sending an IO request to each controller in the cluster. I haven’t really seen any improvement in the performance by doing so but it is only a recommendation at the end of the day. If you see that you are not going to achieve any significant performance by doing so, the onus is on you to make that decision.
First, lets get all the volumes that’ve been configured on Xtreme IO.
esxcli storage nmp path list | grep XtremeIO
this will give you the naa.id of all the volumes that are running on XtremeIO.
Now lets set the policy to RR for those volumes.
esxcli storage nmp device set — device <naa.id> -psp VMW_PSP_RR (5.x)
esxcli nmp device setpolicy — device <naa.id > –psp VMW_PSP_RR (4.1)
You can also set the default path selection policy for any storage in 5.x by identifying the SATP and modifying it with the command
esxcli storage nmp satp set –default-psp=VMW_PSP_RR —satp =<your_SATP_name>
To set the number of IOs to 1 in RR,
esxcli storage nmp psp roundrobin deviceconfig set -d <naa.id> –iops 1 –type iops (5.x)
esxcli nmp roundrobin setconfig –device=<naa.id> –iops=1 (4.1)
Of course if you dont want to go change all of this, you can still use Powerpath.
Host Parameters to Change
For best performance we also need to set a couple of disk parameters. You can do this via GUI or the easier way via CLI (preferred).
Using GUI, set the following parameters Disk.SchedNumReqOutstanding to 256 & Disk.SchedQuantum to 64
Note: If you have non Xtreme IO volumes on these hosts, they may lead to over stress on the controllers and cause performance degradation while communicating with them.
Using Command line in 4.1, set the parameters using
esxcfg-advcfg -s 64 /Disk /SchedQuantum
esxcfg-advcfg -s 256 /Disk /SchedNumReqOutstanding
to query that its been set correctly, use
esxcfg-advcfg -g /Disk /SchedQuantum
esxcfg-advcfg -g /Disk /SchedNumReqOutstanding
You should also change the Disk.DiskMaxIOSize from the default of 32767 to 4096. This is because XtremeIO reads and writes by default in 4k chunks and thats how it gets the awesome deduplication ratio.
In ESXi 5.0/5.1 you can set the SchedNumReqOutstanding by using
esxcli storage core device set -d <naa.id> -O 256
In vSphere 5.5 you can set this paramter on each volume individually instead of configuring on per host.
vCenter Server Parameters
Depending on the number of xBricks that are configured per cluster, the vCenter server parameter
config.vpxd.ResourceManager.maxCostPerHost needs to be changed. This adjusts the maximum number of full cloning operations.
One xBrick Cluster – 8 (default)
Two xBrick Cluster – 16
Four xBrick Cluster – 32
Thats the end of this post. Please feel free to correct me if I’ve got any commands wrong.
Recommendation (as per AusFestivus’ comment): EMC recommend that PP be used for best performance. But it always comes down to the cost constraints and how much the client wants to spend. In my opinion, PP is more like “nice to have for best performance without tinkering”. But if you can keep tinkering and changing things to get the best performance out, you can do without PP.
This is part 2 of the 2 part Xtreme IO Blog post. You can find the first one here.
We will cover the basics of Monitoring and Security in Xtreme IO in this post. Please remember this is not a deep dive of the newest AFA. You should still consult the Official EMC product Documentation for up to date information.
XMS can be locked down to use either local accounts or ldap authenticated accounts. There are default accounts that are pre-configured on the XIO. However, it is possible to change the default passwords of the root, IPMI, tech and admin accounts.
There are 4 user roles that are available on the XMS.
- Technician – Only EMC Technician should use this
- Administrator – All access
- Configuration – Cant edit, delete or add users
The XtremIO Storage Array supports LDAP users’ authentication. Once configured for LDAP authentication, the XMS redirects users’ authentication to the configured LDAP for Active Directory (AD) servers and allows access to authenticated users only. Users’ XMS permissions are defined, based on a mapping between the users’ LDAP/AD groups and XMS roles.
The XMS Server LDAP Configuration feature allows using a single or multiple servers for the external users’ authentication for their login to the XMS server.
The LDAP operation is performed once when logging with external user credentials to an XMS server. The XMS server operates as an LDAP client and connects to an LDAP service, running on an external server. The LDAP Search is performed, using the pre-configured LDAP Configuration profile and the external user login credentials.
If the authentication is successful, the external user logs in to the XMS server and accesses the full or limited XMS server functionality (according to the XMS Role that was assigned to the AD user’s Group). The external user’s credentials are saved in the XMS Cache and a new user profile is created in the XMS User Administration configuration. From that point, the external user authentication is performed internally by the XMS server, without connecting to an external server. The XMS server will re-perform the LDAP Search only after the LDAP Configuration cache expires (cache expiration default value is 24 hours) or at the next successful external user login if the external user credentials were removed from the XMS Server User Administration manually.
LDAP user authentication can be configured and managed via either GUI or CLI.
Monitoring can be done from both a physical level and a logical level using the new Xtreme IO (XIO) Management Server (called XMS hereafter). In the current environment, I only have one xBrick for testing. So my XMS is only managing a cluster of 1 xBrick. At this point in time, a single XMS can only manage one cluster (although this might change in the next few code revisions) with a maximum of 8 xBricks. The unofficial word from my colleague in EMC is that this will be updated to support upto 16 xBricks. I have deployed the XMS as a VM afterall why would anyone want a physical server these days except to run ESXi.. right ?
Monitoring on the physical devices in the XIO cluster is very easy. Click on the “Hardware” link in the application and it will be show all the physical components of the cluster (including the infiniband switches) but since I only have one xBrick, that is all thats shown.
Hover the mouse over the components and the health status of that component will be shown. This goes down the level of each disk in the 25 SSD DAE and also the disks in the controllers. So all aspects can be seen either wholistically or individually.
We can also check the back side of the unit including the cabling between various components. If we have an Infiniband switch, we can also check the cabling between the controllers and the infiniband switches.
That takes care of the physical monitoring of the components.
Alerts & Events
To look at the alerts and events on the XIO, click on the Alerts & Events link. This will show us all the alerts that are currently unacknowledged on the XIO and also the various events that happened. We can clear the logs if required when diagnosing any problem if it does get filled up.
It is possible to use SMTP, SNMP or syslog to provide alerting and log management. We can do this in Administration tab, under Notification.
To configure SMTP, we need to enter the following details (Select SMTP) and click Apply
To configure SNMP, enter the Community name and server details and click Apply.
To configure Syslog, enter the syslog server details and click Apply.
This concludes my 2 part Introduction to Xtreme IO. Thank you for reading.
I’ve been running a new POC for XtremeIO AFA in the last couple of weeks. To say that its easy to provision storage and manage is an understatement. This hardware (and software) makes every other storage system I have ever used look ancient and sluggish.
The beauty of this All Flash Array is that everything is redundant staring with the number of controllers to the fabric connections or iSCSI connections and with the number of batteries for each xBrick.
We can have upto 8 (going to 16 in the future) xBricks in a cluster and have multiple clusters. But each cluster must be managed by a single XMS, aka Management Server. There is a very good blog post by Jason Nash about the architecture of the XtremeIO AFA.
Now lets move onto the real stuff.
Adding Initiators (Host HBA)
Assuming that you’ve zoned everything correctly and properly (single initiator zoning), you should be able to add the host initiators in less than 2 steps for each initiator.
1. Logon to the XMS using the admin credentials
2. Click on the Configuration Link. Click on the Add Initiator Group link.
3. In the window that pops open, enter a name of the Group. I’ve called it ESX_VDI.
4. Enter the details of the Initiator, give it a name and select the appropriate Port Address.
5. Follow the same steps to add the other initiators in the group.
6. Click Next and then Finish.
You’ve added the hosts. Now lets give them some storage to play with.
There is no RAID to configure, no spindle numbers to calculate, no calculation required for creating a volume in XtremeIO. Each xBrick gives us a usable capacity of 7TB of physical storage. Logically thanks to the de-duplication capability of the XtremeIO software, we can get at least 7:1 logical to physical storage. So you can create volumes at least of 50 TB size. I have successfully run VMs on 10:1 logical to physical before I saw performance hit on the VMs, want to know how much that hit was .. 1ms. Yes you saw it right. On a over commit of 10:1 I got a latency of 1ms on my VMs.
OK enough rambling, now lets see some pictures.
1. On the LHS of the Configuration pane, Click on Add.
2. Give the volume a name and how big you want the volume to be.
3. Click Next. Choose a folder name and Click Finish.
4. Thats it. In my example below, I created a 4TB volume.
5. You can also add multiple volumes with same size and naming convention. Click on Add multiple and follow the same steps.
Now lets give these volumes to the hosts.
Masking Volumes to the hosts
1. Select the Initiator Group you want to add storage to.
2. Select the volume you want to mask.
3. Click on Map All Button.
4. Give it a LUN ID.
5. Click Apply. You’re done.
There it is. Now logon to the vCenter Server and rescan the cluster(s) to see the new storage. You can then format it as VMFS or whatever you want to.
Next part will cover a little bit of security and monitoring within XtremeIO.
Provisioning VPLEX datastores has always been a feature that was missed in UIM/P. All the provisioning and host “discover” was a manual task. In the latest version og UIM/P 4.0, VPLEX datastore provisioning is available. This should make UIM/P a lot more usable for the enterprise customers.