Broken Security

I’ve been using the Windows Optimization Guide for View Desktops guide on the VMware website for a long time. Hidden inside the PDF are some text file attachments that when converted to .bat, run though and disable most of the functions that bloat virtual desktop linked clones or are totally uncessary when accessed from a thin client or mobile device. However around October of last year during a customer engagement I noticed the PDF was updated with a revised version. That version has caused me a lot of headaches.

After running the revised scripts, I was basically left with broken templates. Internet Explorer would no longer load. Breaking Internet Explorer sort of makes me look like an idiot after I deploy entire pools of desktops and companies can’t use them to run their corporate webapps.

I’d never got around to figuring out exactly what caused this issue, and because of it I’d been using a modified version of an older script during my engagements. However during a View implementation this week I was unable to find this older copy and so I decided I was going to figure out what made this new script such a pain.

ASLR

Address space layout randomization (ASLR) is a computer security technique involved in protection from buffer overflow attacks. In order to prevent an attacker from reliably jumping to a particular exploited function in memory (for example), ASLR involves randomly arranging the positions of key data areas of a program, including the base of the executable and the positions of the stack, heap, and libraries, in a process’s address space. (Wikipedia)

ASLR was a feature added to Windows starting with Vista. It’s present in Linux and Mac OS X as well. For reasons unknown, the VMware scripts disable ASLR. Specifically, it’s done by this registry entry command:

reg ADD "HKLMSystemCurrentControlSetControlSession ManagerMemory Management" /v MoveImages /t REG_DWORD /d 0x0 /f

Internet Explorer will not run with ASLR turned off. After further testing, neither will Adobe Reader. Two programs that are major targets for security exploits, refuse to run with ASLR turned off.

The “problem” with ASLR in a virtual environment is that it makes transparent memory page sharing less efficient. How much less? That’s debatable and dependent on workload. It might gain a handful of extra virtual machines running on a host, and at the expense of a valuable security feature of the operating system.

For some reason, those who created the script at VMware have decided that they consider it best practice for it to be disabled.

Or do they?

I actually can’t find anywhere else in the document that says that ASLR should be disabled. Even in the table that lists all the changes that are done by the script, it’s not listed, yet under the “changes since last version” the command referenced above is listed. I also can’t find anything else on VMware’s site that says it should be disabled. Actually, I found information to the contrary.

Back in 2011, a VMware blog entry by Eric Horschman specifically called out this issue and clarified that it is not recommended to disable ASLR in a general sense.

The same is true from André Leibovici (previously an Architect in the Office of the CTO End User Computing at VMware, now with Nutanix, and someone I consider to be a virtual desktop expert) who on his site myvirtualcloud.net back in 2011 had this to say about ASLR, specifically in VDI:

Is it a good practice to disable ASLR? The short answer is No. Unless you are pushing very high levels of memory overcommit in a 32-bit desktop VDI environment, you have a lot more to lose than to gain from disabling ASLR. On 64-bit platforms the loss of opportunities to share pages is much less due to the large memory page nature.

So how did this get added to the standard optimization script? Given VMware’s public position that runs contrary to this, I assume it’s there by mistake. I actually notified VMware about the fact that the script was breaking Internet Explorer back in October but it apparently had never been isolated, or possibly never investigated.

(The revised scripts also previously contained a bunch of incorrect ‘ and “ characters in it, that also caused running most of the commands in it to fail. This was corrected.)

Sadly, the reason why Eric and Andre even brought this up in 2011 was because of Microsoft. In a couple of Microsoft blog entries (1/2) they started spreading some FUD by attempting to say that VMware was suggesting that customers disable ASLR.

The reality was it was an topic was addressed to say that yes, you can increase consolidation ratios by turing off ASLR, but at the expense of security. There was a bit of back and forth from some of the VMware folks suggesting that Microsoft’s implementation of ASLR isn’t even all that effective at mitigating malware infections. I won’t get into that.

Regardless, it’s a security feature of the operating system, and in the case of the applications referenced above, one that totally breaks functionality. Hopefully, VMware will correct this soon. In the mean time, I’ll be commenting on this line on all future engagements.

Great Success

There is a bug in Windows Server 2012 R2 in the volume license activation wizard, that if you don’t change the Key Management Service port setting when applying the configuration (from “0” to whatever you want it to be, such as the default of 1688) you get this absolutelty most unhelpful success/error message.

The following error has occurred. Please resolve the error and try again. Description: STATUS_SUCCESS

NIC Life

Life is rough for a ESXi network card these days, both pNIC and vNIC. It’s especially bad if you’re using E1000/E1000e adapters in your VMs, or using Broadcom network cards, or a combination of both.

And considering Broadcom cards are the built in pNIC adapters for nearly every piece of server hardware, and the E1000 driver is the default Windows Server vNIC adapter in VMware: these are two incredibly common things to have happen, so what environment isn’t using a combination of both?

On the physical NIC side, VMware has identified an issue with the tg3 drivers in use since ESX 3.5 that can cause data corruption.

The options for resolution there are to upgrade the Broadcom driver on your hosts, or disable TCP Segmentation Offload on your cards.

On the virtual NIC side, VMware has identified an issue with the E1000 adapter that causes the purple screen of death on hosts with virtual machines using this adapter on anything running ESXi 5.0, 5.1 or 5.5.

Options for resolution are to convert virtual machines to another driver such as VMXNET3 or disable Receive Side Scaling inside the guest operating system.

For ESXi 5.1 hosts, Update 2 has been identified as having a fix for this issue, but doing so may introduce its own set of issues.

Again, the workaround is to use something like the VMXNET3 adapter in your virtual machines. You can also install patch ESXi510–201402001 after installing Update 2 to fix the memory leak that causes the second issue.

Unless you can’t do so for an incompatibility reason I would suggest using VMXNET3 as your default vNIC adapter as best practice. If you have the ability to isolate E1000 virtual machines to a host or subset of hosts within your cluster to prevent a crash from effecting other systems, I would also do this.

FQDN’ed Up

I ran across this little interesting tidbit in an EMC Support article that I wasn’t aware of previously. Using the fully qualified domain name of the EMC Isilon SMB server for file sharing on is necessary for proper load balancing and access:

Always use the fully qualified domain name (FQDN) of a SmartConnect zone when accessing the cluster. If you attempt to use the short name, Windows hosts will attempt to use the NetBios name service (NBNS) to resolve the connection. Because NBNS uses broadcast pings on the network to determine what IP a host is located at, the Windows client will connect to the first node to respond, which might result in client connections not being evenly distributed across the cluster. Additionally, by using the NBNS services, you do not utilize Kerberos for authentication and authorization, and are required to use NTLM (NT LAN Manager) based services, which can lead to permission denied errors.

For the non-Isilon initiated, a SmartConnect Zone is how Isilon does load balancing across various nodes in the cluster. It’s configured as a delegation zone in your DNS, that replies back with a different IP address coorisponding to a physical NIC on an Isilon node. Depending on licensing it can be configured to reply based on basic round robin, or by connection count, CPU or network utilization metrics. It’s important that it functions correctly as not to potentally overload an individual network port and therefore an individual node as an entry point into the cluster when accessing data.

The EMC Support article where it’s referenced (emc14003900) is centered around integrating SMB on Isilon with DFS, but I would think the principles are the same for normal user/server UNC addressing.

Even if it’s not, I’d still consider it best practice to use the FQDN.

Round Robin

So you want to set your datastores to Round Robin, but you’ve got multiple hosts, dozens of datastores, and very little time? Just fire up PowerCLI and run this script. Replace “VMCluster” with the name of your cluster. This will change the multi pathing policy on each datastore, on each host in the cluster.

get-cluster “VMCluster” | Get-VMHost | Get-ScsiLun -LunType disk | Where-Object {$_.MultipathPolicy -ne “RoundRobin”} | Set-ScsiLun -MultipathPolicy “RoundRobin”

A great overview of Round Robin vs Fixed multipathing, specifically on vSphere 5.1 and EMC storage, and why you should be using it, can be found over at vElemental.

Using CDP

The other day I was tasked with adding a new VLAN to a customer’s vSphere cluster. The existing network configuration had just the default VM Network setup, with no trunks or tagged port groups setup. In this case the customer is in the process of adding a few virtual desktops (Citrix, blah) and wanted a separate DHCP scope for those machines.

In order to setup this VLAN I would need to put each host in maintenance mode, reconfigure the physical switch ports that were providing connectivity to that host from access ports to trunk ports, add tags to the existing VM Network and Service Console, and then provide connectivity to their new VLAN by adding a new port group tagged to that VLAN number.

(Note: if you need to trunk the connection the Service Console/Management Network uses, change the VLAN tag before you adjust the physical switch port settings. You’ll lose connectivity to the host temporarily until you change the switch port settings.)

I set about trying to determine where each of the physical NIC ports on the hosts were plugged into their core switch. There are a few options to do this:

  1. Hope that the customer has proper documentation of their environment, from the initial setup and any changes that were made, indicating the switch ports. In this case, the customer did not.
  2. Hope that the switch has comments that indicate what is physically connected to it. In this case, there were no comments.
  3. Physically trace out each connection back to the switches. In this case, we were in the middle of a major winter storm in Kansas City, so I was working remote for the customer.
  4. Use networking commands on the switch to attempt to identify what is plugged into each port.

You might expect that the MAC addresses of the vSwitch’s individual NICs would be listed in the results of a “show mac address-table dynamic” on the switch — except they aren’t. You can see the vNIC this way, but not the pNICs.

If you open the vCenter GUI and go to the Configuration > Networking section, next to each of the physical adapters configured in a vSwitch, you’ll see a blue box. Click on it, and if you’re using Cisco switches (and why wouldn’t you) you’ll see all the data about the switch, port, and configuration of the network port.

You’ll also get these results if you’re running on a UCS chassis against a Nexus switch, but in a slightly different format. With the UCS and other blade chassis type systems you can actually find other ways to determine the switch port you’re connected to, but that’s a topic for another blog post (and once I get more experience on the UCS.)

What if none of this works?

If all this doesn’t work for you, make sure you’re using Cisco switches. CDP is a proprietary protocol, so your Dell, HP, Juniper, 3Com, Netgear, Trendnet, SuperCheapNet switches are probably going to give you any of this data.

However, as of ESXi 5.0, VMware does support Link Layer Discovery Protocol (LLDP), which is the IEEE standardized version of CDP. The problem is they only support it with Distributed vSwitches, which requires Enterprise Plus licensing. A lot of the environments I work in either don’t have that licensing and/or have not adopted Distributed vSwitches. For reasons unknown, VMware does not support LLDP on regular vSwitches. (For more information on how to use LLDP check out Ivo Beerens’ post.)

If you’ve got Cisco equipment, but it’s still not working, make sure CDP is enabled on your hosts. As of ESX 3.5, it should be by default but it may have been disabled. For more information on how to troubleshoot this check out VMware KB1003885.

Host Memory

Memory utilization is important in VMware, most of the time it’s the most limiting factor in the virtual to physical consolidation ratio. Often times I’m tasked with assessing how upgradable a physical host’s current memory configuration is. It’s easy to see from the vSphere Client how much memory you have installed in a host, but when you’re upgrading you need to know exactly how that memory is laid our on your motherboard so you can get the most bang for your buck.

There are basically three ways to do this:

  1. Open up the case and see. This is going to require downtime (because you wouldn’t open the case while you’re running production systems, right?) This is all well and good because you can just vMotion your virtual machines to another host and shut it down. Problem is, if you’re having memory utilization issues, chances are you’re overcommitting on your hosts, so you’re going to need to shut down virtual machines to do this.
  2. Use an out-of band-management utility like DRAC or iLO. Great if your server has them configured, but a lot of people either don’t realize they have these or don’t bother to set them up until someone points out how useful they are. Usually to configure them requires a reboot of the host which means downtime, and I just explained why that’s probably not great in this situation.
  3. SSH into your hosts and run a couple of commands. This is what I’m going to explain how to do.

Everything I’m going to show you is documented from the VMware KB. If you’d rather refer to those go here for ESXi 4.x/5.x or go here for ESX 3.x/4.x. Make sure you know what version you’re checking, so you can use the right commands.

ESXi 4.x/5.x

The first thing you’ll need to do is enable SSH on your hosts. Best practice is to leave SSH off and only turn it on when you need it. You can enable it by opening up the vSphere Client, selecting the Host and Clusters view, and then selecting the host you want to enable SSH on in the left hand window. Select the Configuration tab, and then Security Profile from the options on the left. Under services you’ll see SSH. Click on Properties, select SSH from the list of services, and then press Options. In the window, press Start to enable the SSH service. Leave the settings that ask you about starting this service automatically set to manual. For security, you don’t want SSH turned on all the time. You’ll also get warnings from each host it’s enabled on if you leave it turned on. When we’re done you’ll want to come back here and disable SSH on your host. (Note: If you’ve previously closed port 22 on your ESXi firewall, you’ll need to open that back up. By default the port is open but the service is not running.)

At this point you need to SSH into your host as root. Keep in mind unless you joined your ESXi box to your Active Directory domain, you probably can’t just use your normal network account to get into the host this way. It’s going to be root or another local account you’ve created.

If you’re on Windows, I suggest using Putty. If you’re on a Mac or Linux box, no need to download anything extra as it’s all built in. Just open up Terminal and away you go.

(I’m normally a Mac user, but I access my work demo lab through a Windows 7 virtual machine running on VMware View. So here is the results from Putty.)

You’ll want to do is navigate to a location you can easily access through the vSphere datastore browser. The reason is we’re going to be running a command and outputting the results to a text file so we can easily get the information we want. I suggest using a local disk on the host, ISO/template datastore or maybe a shared datastore that you use for things like dumping host logs. The output file is going to just a few MBs, so it’s not really critical as long as it’s easily accessible. When we’re done we’re going to delete it from the host.

cd /vmfs/volumes/YOUR_DATASTORE

You’ll notice that the result for your command will change your current directory to something like this: /vmfs/volumes/4ea066d9-d9f09a90-c026–0025b5aa002c — This is normal. Do not be alarmed.

At this point we’re going to run the command that will query the system for all the physical hardware, and export it to a text file.

cim-diagnostic.sh > YOUR_SERVER_NAME.txt

You can call the file after the > whatever you want. Most of the time I keep it unique because I’m going to be doing this command on multiple systems and want to easily identify which one it came from.

At this point you can go back to the vSphere Client and open up the Datastore Browser on the datastore you ran the command on. You can get to this easily by clicking on the host in Host and Clusters and then under the Summary page, right clicking on the datastore listing and then Browse Datastore.

Use the Datastore Browser to download the file to your desktop. (Right click file > Download)

Now the problem with this file is that Notepad doesn’t know how to handle the way ESXi outputs the file, so when you open it up it looks a little something like this:

I would suggest opening the file in something like Notepad++ which is really far more useful and can read the log file correctly. It’s also helpful for other VMware logs that don’t save whitespace in a way Notepad likes. (Note, Mac users can open the file in TextEdit just fine.)

Run a search within the document and find the section that starts as Dumping instances of CIM_PhysicalMemory. You’ll see the first entry as Tag = 32.0 and if you scroll down all the way though the section it’ll go until run out of memory slots. For instance, the server I ran my export on is a Cisco UCS B250 with 46 memory slots, so the last entry will be 32.45.

The key bits of information here are things MaxMemorySpeed and Capacity if you’re trying to figure out what to buy. Capacity is listed in bytes, so 4294967296 is going to be a 4GB DIMM. There is also lots of other good information in the export such as the position of the DIMM on the motherboard, the node and channel the memory is utilized by, or if the slot is even in use, as well as things like serial numbers and part numbers.

At this point you can delete the file from the host, if you choose, either by utilizing the Datastore Browser or at the SSH session you may still have open.

rm YOUR_SERVER_NAME.txt

Now you can close your SSH session, and turn SSH back off on your host in the same section where you previously turned it on.

ESX 3.x/4.x

The method for obtaining this information on ESX is similar to the ESXi method explained above, the only real difference is that the command utilized is different and the output file isn’t as detailed (although it’s much easier to read.)

The first thing we’re going to need to do is enable SSH on the host. On ESX 3.x/4.x, SSH is disabled by default for the root account on an ESX host. The SSH service does not allow root logins. Non-root users are able to login with SSH, which you can then elevate this account to the root user. As an alternative to enabling SSH on your host, you can physically login to the console of the host and run the commands as well.

From VMware KB 8375637:

If you do not have any other users on the ESX host, you can create a new user by connecting directly to the ESX host with VMware Infrastructure (VI) or vSphere Client. Go to the Users & Groups tab, right-click on the Users list and select Add to open the Add New User dialog. Ensure that the Grant shell access to this user option is selected. These options are only available when connecting to the ESX host directly. They are not available if connecting to vCenter Server.

If you’re on Windows, I suggest using Putty. If you’re on a Mac or Linux box, no need to download anything extra as it’s all built in. Just open up Terminal and away you go.

(I’m normally a Mac user, but I access my work demo lab through a Windows 7 virtual machine running on VMware View. So here is the results from Putty.)

After logging in to your host with your regular user account we need to elevate to root user:

su -

You’ll be prompted for your root password. Enter it now.

You’ll want to do is navigate to a location you can easily access through the vSphere datastore browser. The reason is we’re going to be running a command and outputting the results to a text file so we can easily get the information we want. I suggest using a local disk on the host, ISO/template datastore or maybe a shared datastore that you use for things like dumping host logs. The output file is going to just a few MBs, so it’s not really critical as long as it’s easily accessible. When we’re done we’re going to delete it from the host.

cd /vmfs/volumes/YOUR_DATASTORE

You’ll notice that the result for your command will change your current directory to something like this: /vmfs/volumes/4ea066d9-d9f09a90-c026–0025b5aa002c — This is normal. Do not be alarmed.

At this point we’re going to run the command that will query the system for all the physical hardware, and export it to a text file.

smbiosDump > YOUR_SERVER_NAME.txt

You can call the file after the > whatever you want. Most of the time I keep it unique because I’m going to be doing this command on multiple systems and want to easily identify which one it came from.

At this point you can go back to the vSphere Client and open up the Datastore Browser on the datastore you ran the command on. You can get to this easily by clicking on the host in Host and Clusters and then under the Summary page, right clicking on the datastore listing and then Browse Datastore.

Use the Datastore Browser to download the file to your desktop. (Right click file > Download)

Run a search within the document and find the section that starts as Physical Memory Array. You should see a summary that lists how many slots the system has, as well as the maximum memory size. Then there will be an entry listed for each memory slot. For instance, on the Dell R710 I ran an export on, there were 18 slots for a maximum of 192GB. If there is memory installed in the slot you’ll see the size of the DIMM, otherwise you’ll see No Module Installed under size.

At this point you can delete the file from the host, if you choose, either by utilizing the Datastore Browser or at the SSH session you may still have open.

rm YOUR_SERVER_NAME.txt

Now you can close your SSH session.

View Borked

I’ve been on a View 5.1 deployment with a customer all week, and part of the project involved deploying VMware vCenter Operations Manager (vCOPS) for View, version 1.01. I’ve done this a couple times before, and had no issues getting the Linux OVA base vApp configured. Then when I went to install the View adapter into a Windows VM, I got a strange message about how this installer was a 32-bit application and not able to run on a 64-bit system.

Two things wrong with this:

  1. Normally 32-bit apps run on 64-bit operating systems, unless they’re specifically configured not to.
  2. vCOPS for View is a 64-bit application, with a 64-bit installer. The system requirements state it can only run on Windows 2008 R2 or Windows 2003 R2 64-bit.

After playing around with the 1.01 installer, and attempting to download and start the installer for 1.0 just fine on the same system, I notice that the published file size on VMware.com is 22MB, but the 1.01 installer I was downloading was only 16MB. I ran an MD5 checksum on this file and it didn’t match the published checksum on the website either. The file creation date shows sometime in late December, while the published file date is somewhere in early October.

Eventually I was able to find a copy of a previously used 1.01 installer on another system, ran a checksum on it, and it matched the published checksum. Installed the adapter using this file and it worked just fine. Customer vCOPS environment is up and running.

I have a support case in with VMware right now letting them know about this issue, hopefully they get it corrected soon. I realize it’s not a particularly popular product compared to something like vSphere or even a View Connection Broker, but it’s hard to see how this could have gone on for a while (nearly a month) without someone else noticing?

TL;DR vCOPS for View 1.01 installer on vmware.com is screwed up, I’m working with VMware to get it fixed.

Fake SSD

Notes on tricking VMware into thinking a datastore is actually an SSD drive. Very useful if you’re in a lab environment and want to just test some of the features in vSphere 5 that center around flash storage (but don’t have the funds to dedicate to actually having it.)

However it’s also useful if you actually have flash storage in a production environment but for some reason vSphere isn’t recognizing that fact.