After a week long of geek’ing out at VMworld, it’s time to do a quick recap on key learning from the show, as well as detailed look of what Nimble had showed off.
Let’s start with new stuff + useful links:
Software Defined ‘X’
Starting with ‘Networking’: Carl and Kit put on a very good show; they made SDN quite simple to understand. Just like ESXi hypervisor creating virtual entities such as vCPU, vNIC, vSCSI emulation, VMDK, NSX creates an abstraction layer to provide L2-L7 services for anything networking. Decoupling network identity and services tied to the identity from physical hardware means policies follow the virtual machine, within the same datacenter or across datacenters. More importantly, this abstraction allows for policy based provisioning and QoS enforcement, all with automation. I can already imagine how much easier it is to deploy a multi-tier application spanning multiple VLANs, plus needs for firwewall, load balancer, NAT’ing (of course, that is after you have made the right level of investment in both hardware, software & thorough design of the SDN infrastructure). Simply put, I can’t wait to try this out in the datacenter!
Now let’s talk more about storage, Software Defined Storage (SDS) – this is an area where you need to put in more thoughts. There are so many storage vendors touting their solution being ‘software defined’. I can summarize the legacy vendor’s pitch with the following picture:
Don’t get me wrong – I love raisin and they are great for your body. When evaluating/assessing whether a storage solution is ‘software defined’, try asking these qualifying questions (remember, nothing matters when the underlying foundation of the architecture is crap – when foundation is crap, any ‘software’ that builds on top of it is – CRAP):
- does the software provide proven high availability (not just a claim on the vendor’s website that says 5 9’s; it has to be based on ACTUAL customer deployments)
- can you upgrade non-disruptively for capacity AND performance? (without this, it doesn’t matter how much automation you can build on top – when it’s time for upgrade, you suffer outage, PERIOD)
- does software upgrade boost storage performance? (if performance improvement only comes with better hardware, then what good is the software?)
- is the software capable of multi-threading? (you might be thinking WTF? You’ll be surprised on this one: most legacy storage vendor’s OSes were optimized for single threaded platforms as they were written years ago)
- does the software provide basic data services with efficiency? (data reduction such as no-dupe/compression; what is the performance impact after these features are ENABLED?) Remember dedupe doesn’t come for free, best way to dedupe is not to duplicate the data in the first place (i.e., redirect-on-write snapshot should be a ‘must-have’)
- does the software provide a platform for supportability automation? (meaning auto support case creation during hardware component failures; ability to look at health, risk and efficiency of the solution through software)
VMware has built a very solid foundation with vSphere (high availability, great performance, extensibility) so I am confident that vSAN (VMware’s future SDS solution) will have its place for certain workloads (big data, dev/test environments that have spare compute capacity).
vSphere 5.5 (what’s new)
Here’s a quick summary of the enhancements in the base vSphere hypervisor platform (remember, upper solution is NOTHING if the underlying foundation is not solid)
Good VMware paper with key feature enhancements
My favorite enhancements are:
- placement of ESXi services on more reliable regions of memory (supporting Reliable Memory Technology) <-perhaps in the future we could specify a policy for business critical application VMs to be pinned to such memory regions?
- vCenter Server Linux Appliance can now support up to 500 hosts & 5000 VMs (wow, time to save some money and get rid of the Windows server running SQL J)
- vSphere App HA – ability to monitor/take action on business critical apps running inside the VM (finally, no 3rd party integration needed to monitor health of specific apps such as Exchange/SQL, and also specific thresholds of how long App HA waits for service restart, and number of times to restart the application service; last but not least, vCenter Server itself could be monitored as well!
- Big Data Extensions – pretty neat plugin for the vSphere web client to instantiate adoop platform (even supports Hbase deployments); this plus vSAN could be an interesting combination to try out
- Exciting storage enhancements include: 1) greater than 2TB VMDK! Yet another reason to stick with VMDK and not use in-guest mounted iSCSI storage! 62TB is now the new maximum 2) auto removal of devices in PDL state (permanent device loss) ßnicely done, if the device has been removed from the array side, there’s no reason to keep it around 3)MSCS now finally supports iSCSI!! Another reason to ditch fibre channel for MSCS purpose better yet, MSCS now supports PSP_RoundRobin (which is a superior PSP policy 4)#esxcli storage vmfs unmap <-you can now specify space to reclaim in # of the blocks (rather than a % value like before; also, you could reclaim space in iterations rather than all or nothing type approach)
Here are some cool demos we showcased in our booth:
New vCenter Plugin
Key enhancements are:
- Support for CHAP
- Creates initiator groups for the ESX server (instead of granting access to ALL)
- Automatically determines the VMFS version based on ESX versions (for ESXi5, option to select block size for VMFS volume is greyed out)
- Automatically applies the correct performance policy (which is ESX)
- Ensures the proper PSP settings (PSP_RR)
- Configures ISCSI as needed (enable software iSCSI)
- Point & Click Snapshot schedule during volume creation
- Automatically rescan for each ESXi host and expand VMFS volume after growing volume
- Built-in RBAC
vVOL (Virtual Volume)
Phase 2 of virtualization journey (virtualizing business critical apps such as Exchange/SQL/Sharepoint)
CLICK ME to find the reference architecture
This reference architecture document highlights best practices jointly written by Alex Fontana from VMware (Exchange/SQL/Sharepoint guru) and myself. Irrespective of what size your deployment is, those fundamental best practices will still apply. Even if you don’t use Nimble Storage, take a look at it as well because vSphere/UCS/networking practices still apply!
esxtop vBrownBag session
I also hosted a DR panel customer with four of Nimble’s own customers (Jerry Yang from Foster Pepper, Bryan Bond from Siemens e-meter, Kevin Duesterdick from VirtaCore, Jeff Winters from City of Hot Springs). We shared the following nuggets that were jointly composed with Ken Werneburg (@vmken) from VMware:
- Do a business impact analysis first to determine what needs to fail over together, then do your storage layout on those principles rather than the traditional “fill it until it’s full then fill another one” approach.
- Test often. Frequent testing reduces configuration drift between test cycles leading to more reliable DR.
- Don’t failover required infrastructure like AD/DNS/AV/etc. Those things should be part of any datacenter, even a warm standby. Many don’t like failing over and new IPs and such, and if they don’t run during the failover then everything else is impacted (configure AD servers from each site to be global catalog server so they are always in sync)
- Avoid re-IP if you can. Stretched VLANs, OTV, GTM/LTM, whatever is a good investment. Often there are hard-coded IP addresses we just don’t know about in lines of code.
- don’t waste $$ & bandwidth by specifying a non-replicated datastore for .vswp files
- Make sure scripts don’t block recovery indefinitely
- don’t overlook the importance of vCPUs assigned to vCenter + SRM server, adjust according to # of VMs you protect (under 100 VM, 2 vCPU is sufficient; but look to expand that to 4 if there are hundreds of VMs); click here for more details on VMware SRM performance best practices
- make sure vmware tools are installed/up-to-date as VM OS heartbeat depends on it
- failback is not a single operation: after reprotect, test failover first – follow by planned migration back to original site, THEN perform test failover again from original site to DR site
I personally found the following to be the most useful (and less obvious):
1)you don’t have to remove the site to rename it! Just go to Advanced Settings in SRM to put in a new name (did I mention I removed my sites just to rename them? That was stupid of me)
2)don’t you get annoyed when you see ‘snap-#@%@#%’ label for datastores that you have performed planned migration & failover on? Well, that can be fixed through ‘Advanced Settings’ within SRM
In summary, new technology is exciting and it’s fun to try it out. Always remember to keep the following principles in mind: What problem/pain point does it address? Is it built on top of a solid foundation? Don’t let “software defined” storage vendors go easy on their claims – question them with basic questions that I listed above.