L1TF

VMware has published new security advisories, knowledge base articles, updates and tools in response to newly disclosed speculative-execution vulnerabilities on Intel CPUs — collectively as “L1 Terminal Fault” — that can occur on Intel processors made from 2009 to 2018.

I’m going to outline our response to this issue, and make an attempt to summarize this complex event as best as I can. I would highly suggest reading through the linked articles as they’ll be more extensive and evolving.

Because this is complex, and evolving, to properly respond to these issues, consider KB55636 as the centralized source of truth from VMware.

Like the previously known Meltdown, Rogue System Register Read, and “Lazy FP state restore” vulnerabilities, the “L1 Terminal Fault” vulnerability can be exploited when affected Intel microprocessors speculate beyond an unpermitted data access.

L1TF – VMM (CVE-2018-3646VMSA-2018-0020)

This is the specific L1TF issue that affects the vSphere/ESXi hypervisor. It has two known attack vectors, both of which need to be mitigated. The first attack vector is mitigated through patches for both vCenter and ESXi.

The second attack vector is mitigated by enabling a new advanced configuration option (hyperthreadingMitigation) included in the updates. However, this advanced configuration option may have a performance impact so we have not be enabled it by default. This will limit operational risk by giving you time to analyze the effects prior to enabling.

There are new updates to both vCenter and ESXi that deliver the mitigation to L1TF:

  • vCenter 6.7.0d, 6.5u2c, 6.0u3h, and 5.5u3j
  • ESXi670-20180840x, ESXi650-20180840x, ESXi600-20180840x, and ESXi550-20180840x

There are also new versions of VMware Workstation (14.1.3) and Fusion (10.1.3) which address this issue.

L1TF – OS (CVE-2018-3620)

This is a local privilege escalation which requires base operating system updates for mitigation. Patches are pending for affected VMware appliances. Make sure you contact your operating system vendor(s) (Microsoft, Oracle, Red Hat, etc) for mitigation instructions in guest virtual machines as well.

L1TF – SGX (CVE-2018-3615)

This does not affect VMware products.

Just enough Windows

I’ve not been a true “Windows user” on a daily basis since the glorious afternoon my first MacBook Pro arrived in 2011. That didn’t exactly mean I quit using Windows on that day, but over time I’ve continued to slim down my actual needs of the Windows desktop operating system to the point where now I keep a Windows VM around for “just enough” of the things I need from it.

Windows 10 is a huge advancement over Windows 7, which is where I left off as a PC user and over these last six years Microsoft has learned a lot from Windows 8.x being such a mess. But Windows 10 is an OS intended for use on everything from 4” smartphones to watercooled gaming rigs with multiple 27” 4K displays.

In this guide I’ve focused on simple methods of stripping out a lot of the things that don’t apply to virtual machine usage, and some of the cruft that is really only useful for someone running it on a daily driver. Typically I can reduce the idle memory and disk footprint by about 25% without loss in necessary functionality.

These instructions are not all specific to VMware Fusion, but some are. This also isn’t designed to be the “ultimate guide” in Windows 10 performance, space savings, or anything else. It’s a quick and clean way to do most of those things but not all encompassing. I think it’s easy for some of those types of optimization guides to focus on getting Windows to the point where it’s so lacking it’s almost unusable or starts breaking core functions.

This is a “light” optimization for my usage. It could it yours as well, if you have similar needs like running a small collection of utility type applications, such as a couple of EMC product deployment tools, or the old VMware client.

Continue reading Just enough Windows

Migrate to VCSA

Last night I did my first customer migration from a Windows based vCenter to the VMware vCenter Server Appliance (VCSA) using the new 6.0 U2M utility.

The customer was previously running vCenter 5.1 GA on a Windows Server 2008 R2 based physical HP host. In order to migrate to the VCSA, we first had to do two in place upgrades of vCenter from 5.1 GA to 5.1 U3, then again from 5.1 U3 to 5.5 U3d. After that, onto the VCSA migration.

Given the length of time the system was running on 5.1 GA code (ouch) and the amount of step upgrades required just to get things cleaned up, there was some cause for nervousness.

I admit, even though I’d read up on it, tested it in a lab, and heard other success stories … I still expected my first try to be kind of a mess.

But, it was not. The entire migration process took around 30 minutes, and was nearly flawless.

I had more issues with the upgrade from 5.1 to 5.5 than anything else during this process. Somewhere during that 5.5 upgrade the main vCenter component quit communicating with the SSO and inventory service. There were no errors presented during the upgrade, but it resulted in not being able to login at all through the C# client, and numerous errors after eventually logging in as [email protected] to the Web Client.

I tried to run through the KB2093876 workarounds, but was not successful. I ended up needing to uninstall the vCenter Server component, remove the Microsoft ADAM feature from the server, and then reinstall vCenter connected to the previous SQL database. Success.

Given those issues, I was nervous about the migration running into further issues, mostly from the old vCenter.

But again, it worked as advertised.

After the migration I did notice the customer’s domain authentication wasn’t working using the integrated Active Directory computer account. After adjuting the identity provider to use LDAP, it worked fine. I’ve had this happen randomly enough on fresh VCSA installs to think its something to do with the customer environment, but I was under the wire to get things back up and felt there was no shame in LDAP.

I’ve done too many new deployments of the VCSA since 5.x to count, and at this point was already pretty well convinced there was no reason for most of my customers to deploy new Windows based vCenters. I’d also done a fair bit of forklift upgrades with old vCenters where we ditch everything to deploy a new VCSA, which isn’t elegant, but it works if for my smaller customers that still don’t yet have anything like View, vRA, SRM, integrated backups/replication, etc.

Now I’m confident that any existing vCenter can be successfully migrated.

Windows vCenters, physical and virtual: I’m coming for you.

Crashing ESXi with Cisco RAID controllers

Recently I had two VMware Horizon View proof of concept setups for work, where we designed an all in one Cisco UCS C240 M4 box, full of local SSD and spindles, in various RAID sets. This lets the customer kick the tires on View in a small setup to see if its a good fit for their environment, but on something more substantial than cribbing resources from the production environment.

  • 5x 300GB 10K SAS RAID 5 for Infrastructure VMs (vCenter, View Broker/Composer, etc)
  • 10x 300GB 10K SAS RAID 10 for VM View Linked Clones
  • 6x 240GB SSD RAID 5 for View Replicas
  • 1x hot spare for each drive type
  • VMware ESXi 6.0 U2 is installed on a FlexFlash SD pair

After getting all the basics configured, we had a single View connection broker, with another View Composer VM on a local SQL Express 2012 instance for the database. Both were version 7.0.2. At the first site the VM base image we attempted to deploy was an optimized Windows 7 x64 instance.

But under any sort of load during a deployment of more than a handful of desktops, the entire box would come to a total stop. In some cases the only way to restore any functionality was to pull the power and restart the infrastructure VMs, one by one. Of course, once the broker and composer instances were connected, they’d attempt to create more desktops and the cycle would continue. In an attempt to isolate the issue, we tried various versions of the VMware Tools, a new Windows 7 x86 image, and I even duplicated the behavior by building a nearly identical View 6.2.3 environment, within the same box.

After digging through the esxtop data as clones were being created, I could see KAVG/Latency across all RAID sets jumps to as high as 6000ms right before all disk activity on the system eventually stops.

It didn’t matter what configuration I tried, it was present with a fresh install of ESXi 6.0 U2, and after applying the latest host patches. It was present on the out of box UCS firmware of 2.0(10), and with the stock RAID drivers from the Cisco ISO. It was present after updating the firmware, and the drivers. It also happened regardless of if the RAID controller write back cache was enabled/disabled for the various groups.

Cisco is very particular about making ESXi drivers for their components match their UCS compatibility matrix, so before I decided to give TAC a call, I made sure (again) that everything matched exactly. TAC ended up reviewing the same logs, to determine if this was a hardware issue, and while they made a couple of suggestions for adjustments, they were not successful in diagnosing a root cause. Yet, they insisted based on what they were seeing that it was not a hardware issue.

With this particular customer, we were also impacted by a variety of issues relating to the health of the DNS and Active Directory environment. With that in mind, we decided to focus on fixing the other environmental issues and in the meantime, not overload the UCS box until a deeper analysis could be done.

Try Try Again

A day or so into the second setup at another customer, and I encountered the exact same issues. This time with a Windows 10 x64 image, and View 7.0.2. The same crazy latency numbers under any amount of significant load, until the entire box stopped responding.

The physical configuration differed slightly in that we were integrating the C-Series UCS into the customers fabric interconnects, so the firmware and driver versions were even more different than the first host which was a standalone configuration connected to the customer’s network. After digging into it again with a fresh brain, and more perspective, I found the cause.

I started looking through the RAID controller driver details again. In both cases, VMware uses the LSI_MR3 driver as the default driver for the Cisco 12G RAID (Avago) controller in ESXi 6.0 U2. In both environments I verified that we were running the suggested driver versions based on the Cisco UCS compatibility matrix, and we were. So I started digging at this controller and wondered what VMware suggests for VSAN (keeping in mind we aren’t running VSAN at either site) and sure enough, they DO NOT suggest using the LSI_MR3 driver, but instead list the “legacy” MEGARAID_SAS driver as their recommendation, for the exact same controller.

After applying the alternative driver, I’ve not been able to break the systems.

What is odd, is that this appears to be related specifically to Cisco’s version of the controllers.

This week I did a similar host setup (although not for View) using a bunch of local SSD/SAS drives in a Dell PowerEdge 730xd, with their 12G PERC H730 RAID cards (which from what I can see appear to be rebranded versions of the same controller) and VMware’s compatibility matrix has the LSI_MR3 drivers listed.

I left those drivers enabled, and the customer ran a series of agressive PostgreSQL benchmarks against the SSD sets, with impressive results, and no issues from the host.

So, long story short, if you’re using local RAID sets for anything other than some basic boot volumes that don’t need any serious I/O, with the Cisco 12G RAID controller, you don’t want to use the Cisco recommended drivers.

Installation instructions

  • Download the new driver (for ESXi 6.0 U2)
  • Extract the .vib file from the driver bundle and copy it to a datastore on the host
  • Enable SSH on the host and connect to it via your terminal application of choice
  • Apply the driver from the SSH session and disable the old one.
  • Reboot the host
  • Reconnect via SSH, and run core adapter list command to verify it’s active

This should verify that your RAID controller (typically either vmhba0 or vmhba1 is now using the megaraid_sas driver. If the “UID” is listed as “Unknown” in this readout, it’s normal.

RPA ‘Factory Reset’

I ran into a situation recently where the need arose to effectively “factory reset” an Generation 5 EMC RecoverPoint Appliance (Gen 5 RPA). In my case, I had one RPA where the local copy of the password database had become corrupted, but the other three systems in the environment were fine. There was nothing physically wrong with the box, I just wanted to revert it back to new and treat it like a replacement unit from EMC, and rejoin it back to the local cluster.

From what I could find, EMC had no documented procedure on how to do this. So after finding a blog entry and EMC Communities post (that individually did not help) here it is:

  • Attach a KVM to the failed appliance and reboot.
  • Hit F2 to boot into the system BIOS (the password emcbios).
  • Under USB settings, Enable Port 60/64 Emulation.
  • Save your settings and reboot the appliance.
  • This time hit Ctrl + G to enter the RAID BIOS.
  • Select the RAID 1 virtual drive and start a Fast Init.
  • Reboot the appliance.
  • Hit F2 to boot back into the system BIOS.
  • Under USB settings, Disable Port 60/64 Emulation.
  • Reboot the appliance and verify that no local OS is installed.
  • Insert the RecoverPoint install CD (the one you created after you downloaded the ISO from EMC Support and after you’ve burned it) and press enter to start the install.
  • The installation does not require any user interaction, your appliance will reboot when its competed into a “like new” status.
  • Rejoin the appliance to the cluster using procedures generated from Solve Desktop. (You can ignore instructions about rezoning fibre channel connections, or spoofing WWPNs, since none of this will have changed.)

The key points here are the bits about Port 60/64 Emulation. If you don’t do this, the RAID BIOS will load to a black screen and take you nowhere. Likewise, if you leave it enabled your RecoverPoint OS may not install correctly.

Clone VM from snapshot

Have you ever wanted to easily clone a virtual machine from a snapshot, and have the clone reflect the source as it existed at that point in time, as opposed to the current status of the source? Jonathan Medd (@jonathanmedd) has a great PowerCLI script that I found yesterday, to do exactly this.

Copy the contents of his script into a new .ps1 file, save it, and then execute the script within a PowerCLI window to add the function to your session. Then run the new function to create your clones. By default it uses the last snapshot in the chain, but you can request a snapshot by name as explained on his site.

New-VMFromSnapshot -SourceVM VM01 -CloneName "Clone01" -Cluster "Test Cluster" -Datastore "Datastore01"

iCloud Photo Library, continued

My second day transferring my iPhoto library to iCloud Photo Library seems to be going very well. The “optimize storage” feature on the iOS devices is going to save users a ton of space.

Yesterday when I posted my last entry I had a 16GB iPad completely full (which was roughly 7GB of photos.) When I returned, all the photos had been uploaded to iCloud, and returned 5GB of space. No matter what I throw at this (and I have about 19GB of images in iCloud now) the devices sit around 2GB utilized for photo storage.

When photos further back in the catalog that are not currently on the device are accessed, they’re retrieved from the cloud in full resolution.

I’m only about 1/5th the way through my library. I’ve been doing it in chunks as I have time, because during the upload process I tend to fully saturate my 5Mb upstream home connection.

If you’ve not turned on iCloud Photo Library yet, even if you don’t intend to do as I’m doing and dump everything into it, you’re really missing out.

From iPhoto to iCloud Photo

When I saw the new iCloud Photo Sync demo at WWDC, I was in love.

Photo storage and syncing has been a struggle of mine for a while. I’ve bounced between external drives (which makes accessibility when I’m not at home difficult) and using local storage (which wastes expensive MacBook SSD space) … but never been happy. I’ve switched between Lightroom and Aperture for my “professional” images (AKA those taken when my Nikon DSLR) and mostly used iPhoto for my iPhone captured images.

The other issue was 16GB iOS devices fill up quick these days. So to save space, I would regularly sync my devices back to iPhoto and then delete the photos from my phone, but again, this made accessing older photos difficult when on the go.

With the convergence of getting better and better iPhone cameras that rival my 8 year old Nikon D200, and getting tired of paying for Adobe software updates, I eventually merged everything into iPhoto.

Now, with iOS 8.1, the iCloud Photo Sync beta rollout has begun, but only on iOS devices and via the iCloud website. The previously announced Mac app is slated for early 2015. But I want all my stuff in Apple’s cloud now, accessible on every device.

I figured out how:

  • Make sure you have iCloud Photo Sync enabled on your iOS devices.
  • Open iPhoto, open Finder > AirDrop on your Mac.
  • Open Photos on your iOS device.
  • Drag and drop photos from iPhoto to your iOS device of choice via AirDrop.
  • This triggers automatic sync to iCloud which starts dropping optimized versions all around the place.

I’m currently chugging back through May 1 of this year, which I only stopped there because that filled up my iPad with photos, and I want to see if after it uploads how it smashes the used space back down. I could keep going with my iPhone 6 that has another 40GB free, but this is enough experimentation for now.

I’ll also probably have to increase my 20GB iCloud plan to keep going beyond what’s in there now. Once I’ve got things moved off, I’ll be able to get my local copies moved back to external storage and then at some point once the Mac Photos app is released figure out how I want to deal with my local copies again.

I think my iPad will become central to future workflow for editing. I’ve long owned the camera connection kit, but never used it. Now it’s going to become the primary injection point of new images taken with the DSLR or editing ones taken with iPhone. (Especially now that Pixelator for iPad is here!).

View guide, ASLR, no more

A few months ago I wrote about the VMware View optimization script breaking Internet Explorer and Adobe Acrobat through the addition of a registry entry that disabled Address Space Layout Randomization (ASLR):

ASLR was a feature added to Windows starting with Vista. It’s present in Linux and Mac OS X as well. For reasons unknown, the VMware scripts disable ASLR.
Internet Explorer will not run with ASLR turned off. After further testing, neither will Adobe Reader. Two programs that are major targets for security exploits, refuse to run with ASLR turned off.
The “problem” with ASLR in a virtual environment is that it makes transparent memory page sharing less efficient. How much less? That’s debatable and dependent on workload. It might gain a handful of extra virtual machines running on a host, and at the expense of a valuable security feature of the operating system.
For some reason, those who created the script at VMware have decided that they consider it best practice for it to be disabled.

At the VMware Partner Technical Advisory Board on EUC last month, I pointed this out to some VMware people and sent a link to the blog entry.

Over the weekend I got a tip from Thomas Brown from over at Varrow:

Today I had an opportunity to download the updated scripts (available here) and was very pleased to see:

 rem *** Removed due to issues with IE10, IE11 and Adobe Acrobat 03Jun2014 rem Disable Address space layout randomization rem reg ADD "HKLMSystemCurrentControlSetControlSession ManagerMemory Management" /v MoveImages /t REG_DWORD /d 0x0 /f

Success!

As always, please review the rest of the contents to make sure the changes that the script makes are approprate for your environment.

Cisco Jabber & Persona Management

I just got finished with a customer issue who had deployed Cisco Jabber along with VMware View, using Persona Management and floating desktops set to refresh at logoff. Much to their annoyance, users would have to reconfigure their Cisco Jabber client with the server connection settings and any client customizations made were lost after logging back in to the desktops.

After looking into this, what it looked like was happening was that the Jabber configuration XML files were not being sync’d down to the local PC before the Jabber client was launching and this was causing the settings to default back to a non-setup state. Even though the configuration data stored in jabberLocalConfig.xml was saved to the Persona Management share it never had a chance to get loaded before it was overwritten.

The issue was resolved by adjusting Persona Management group policies to precache the settings stored on the persona share to the virtual desktop before completing login.

Modify the Persona Management GPO setting “Files and folders to preload” to include the following directory:

AppDataRoamingCiscoUnified CommunicationsJabberCSF

Server settings, custom adjustments to the client are now maintained across desktop sessions. WIN!

Simple guide to datacenter power

Power… it is the only thing that you will find more prevalent in a datacenter than racks, yet many times when discussing upgrades and new installations it’s the part that no one ever mentions.

  • the IT team isn’t in charge of the power design (leased building, union, or separate electrical department)
  • have always just used 120v “normal” stuff under 1800 watts
  • aren’t an electric engineer/don’t understand what Amps, Volts, Watts are
  • don’t understand all of the options for connectors/cords

I’m guilty of these things, especially when I was just an administrator. Since becoming a consultant I’ve had to take a crash course (heh) in things like the differences between an C13 and an NEMA 5–15, 120 vs 208, etc.

Power always seems to be a major issue on projects these days, especially as more and more customers adopt blade systems like the Cisco UCS. What has really been difficult has been the latest generation of EMC VNX now requires 208v power on the Disk Processor Enclosure (there is a block only 5200 model that can run on 120, but you have to order it ahead of time, it doesn’t autoswitch by default anymore.)

Better understanding by customers is essential.

VMware guidance breaks Windows security

I’ve been using the Windows Optimization Guide for View Desktops guide on the VMware website for a long time. Hidden inside the PDF are some text file attachments that when converted to .bat, run though and disable most of the functions that bloat virtual desktop linked clones or are totally uncessary when accessed from a thin client or mobile device. However around October of last year during a customer engagement I noticed the PDF was updated with a revised version. That version has caused me a lot of headaches.

After running the revised scripts, I was basically left with broken templates. Internet Explorer would no longer load. Breaking Internet Explorer sort of makes me look like an idiot after I deploy entire pools of desktops and companies can’t use them to run their corporate webapps.

I’d never got around to figuring out exactly what caused this issue, and because of it I’d been using a modified version of an older script during my engagements. However during a View implementation this week I was unable to find this older copy and so I decided I was going to figure out what made this new script such a pain.

ASLR

Address space layout randomization (ASLR) is a computer security technique involved in protection from buffer overflow attacks. In order to prevent an attacker from reliably jumping to a particular exploited function in memory (for example), ASLR involves randomly arranging the positions of key data areas of a program, including the base of the executable and the positions of the stack, heap, and libraries, in a process’s address space. (Wikipedia)

ASLR was a feature added to Windows starting with Vista. It’s present in Linux and Mac OS X as well. For reasons unknown, the VMware scripts disable ASLR. Specifically, it’s done by this registry entry command:

reg ADD "HKLMSystemCurrentControlSetControlSession ManagerMemory Management" /v MoveImages /t REG_DWORD /d 0x0 /f

Internet Explorer will not run with ASLR turned off. After further testing, neither will Adobe Reader. Two programs that are major targets for security exploits, refuse to run with ASLR turned off.

The “problem” with ASLR in a virtual environment is that it makes transparent memory page sharing less efficient. How much less? That’s debatable and dependent on workload. It might gain a handful of extra virtual machines running on a host, and at the expense of a valuable security feature of the operating system.

For some reason, those who created the script at VMware have decided that they consider it best practice for it to be disabled.

Or do they?

I actually can’t find anywhere else in the document that says that ASLR should be disabled. Even in the table that lists all the changes that are done by the script, it’s not listed, yet under the “changes since last version” the command referenced above is listed. I also can’t find anything else on VMware’s site that says it should be disabled. Actually, I found information to the contrary.

Back in 2011, a VMware blog entry by Eric Horschman specifically called out this issue and clarified that it is not recommended to disable ASLR in a general sense.

The same is true from André Leibovici (previously an Architect in the Office of the CTO End User Computing at VMware, now with Nutanix, and someone I consider to be a virtual desktop expert) who on his site myvirtualcloud.net back in 2011 had this to say about ASLR, specifically in VDI:

Is it a good practice to disable ASLR? The short answer is No. Unless you are pushing very high levels of memory overcommit in a 32-bit desktop VDI environment, you have a lot more to lose than to gain from disabling ASLR. On 64-bit platforms the loss of opportunities to share pages is much less due to the large memory page nature.

So how did this get added to the standard optimization script? Given VMware’s public position that runs contrary to this, I assume it’s there by mistake. I actually notified VMware about the fact that the script was breaking Internet Explorer back in October but it apparently had never been isolated, or possibly never investigated.

(The revised scripts also previously contained a bunch of incorrect ‘ and “ characters in it, that also caused running most of the commands in it to fail. This was corrected.)

Sadly, the reason why Eric and Andre even brought this up in 2011 was because of Microsoft. In a couple of Microsoft blog entries (1/2) they started spreading some FUD by attempting to say that VMware was suggesting that customers disable ASLR.

The reality was it was an topic was addressed to say that yes, you can increase consolidation ratios by turing off ASLR, but at the expense of security. There was a bit of back and forth from some of the VMware folks suggesting that Microsoft’s implementation of ASLR isn’t even all that effective at mitigating malware infections. I won’t get into that.

Regardless, it’s a security feature of the operating system, and in the case of the applications referenced above, one that totally breaks functionality. Hopefully, VMware will correct this soon. In the mean time, I’ll be commenting on this line on all future engagements.

Great worst successful error

There is a bug in Windows Server 2012 R2 in the volume license activation wizard, that if you don’t change the Key Management Service port setting when applying the configuration (from “0” to whatever you want it to be, such as the default of 1688) you get this absolutelty most unhelpful success/error message.

The following error has occurred. Please resolve the error and try again. Description: STATUS_SUCCESS

Being a VMware NIC isn’t easy these days

Life is rough for a ESXi network card these days, both pNIC and vNIC. It’s especially bad if you’re using E1000/E1000e adapters in your VMs, or using Broadcom network cards, or a combination of both.

And considering Broadcom cards are the built in pNIC adapters for nearly every piece of server hardware, and the E1000 driver is the default Windows Server vNIC adapter in VMware: these are two incredibly common things to have happen, so what environment isn’t using a combination of both?

On the physical NIC side, VMware has identified an issue with the tg3 drivers in use since ESX 3.5 that can cause data corruption.

The options for resolution there are to upgrade the Broadcom driver on your hosts, or disable TCP Segmentation Offload on your cards.

On the virtual NIC side, VMware has identified an issue with the E1000 adapter that causes the purple screen of death on hosts with virtual machines using this adapter on anything running ESXi 5.0, 5.1 or 5.5.

Options for resolution are to convert virtual machines to another driver such as VMXNET3 or disable Receive Side Scaling inside the guest operating system.

For ESXi 5.1 hosts, Update 2 has been identified as having a fix for this issue, but doing so may introduce its own set of issues.

Again, the workaround is to use something like the VMXNET3 adapter in your virtual machines. You can also install patch ESXi510–201402001 after installing Update 2 to fix the memory leak that causes the second issue.

Unless you can’t do so for an incompatibility reason I would suggest using VMXNET3 as your default vNIC adapter as best practice. If you have the ability to isolate E1000 virtual machines to a host or subset of hosts within your cluster to prevent a crash from effecting other systems, I would also do this.

Use FQDNs when doing SMB on Isilon

I ran across this little interesting tidbit in an EMC Support article that I wasn’t aware of previously. Using the fully qualified domain name of the EMC Isilon SMB server for file sharing on is necessary for proper load balancing and access:

Always use the fully qualified domain name (FQDN) of a SmartConnect zone when accessing the cluster. If you attempt to use the short name, Windows hosts will attempt to use the NetBios name service (NBNS) to resolve the connection. Because NBNS uses broadcast pings on the network to determine what IP a host is located at, the Windows client will connect to the first node to respond, which might result in client connections not being evenly distributed across the cluster. Additionally, by using the NBNS services, you do not utilize Kerberos for authentication and authorization, and are required to use NTLM (NT LAN Manager) based services, which can lead to permission denied errors.

For the non-Isilon initiated, a SmartConnect Zone is how Isilon does load balancing across various nodes in the cluster. It’s configured as a delegation zone in your DNS, that replies back with a different IP address coorisponding to a physical NIC on an Isilon node. Depending on licensing it can be configured to reply based on basic round robin, or by connection count, CPU or network utilization metrics. It’s important that it functions correctly as not to potentally overload an individual network port and therefore an individual node as an entry point into the cluster when accessing data.

The EMC Support article where it’s referenced (emc14003900) is centered around integrating SMB on Isilon with DFS, but I would think the principles are the same for normal user/server UNC addressing.

Even if it’s not, I’d still consider it best practice to use the FQDN.

Set all datastores to round robin using PowerCLI

So you want to set your datastores to Round Robin, but you’ve got multiple hosts, dozens of datastores, and very little time? Just fire up PowerCLI and run this script. Replace “VMCluster” with the name of your cluster. This will change the multi pathing policy on each datastore, on each host in the cluster.

get-cluster “VMCluster” | Get-VMHost | Get-ScsiLun -LunType disk | Where-Object {$_.MultipathPolicy -ne “RoundRobin”} | Set-ScsiLun -MultipathPolicy “RoundRobin”

A great overview of Round Robin vs Fixed multipathing, specifically on vSphere 5.1 and EMC storage, and why you should be using it, can be found over at vElemental.

Using CDP with vSphere hosts

The other day I was tasked with adding a new VLAN to a customer’s vSphere cluster. The existing network configuration had just the default VM Network setup, with no trunks or tagged port groups setup. In this case the customer is in the process of adding a few virtual desktops (Citrix, blah) and wanted a separate DHCP scope for those machines.

In order to setup this VLAN I would need to put each host in maintenance mode, reconfigure the physical switch ports that were providing connectivity to that host from access ports to trunk ports, add tags to the existing VM Network and Service Console, and then provide connectivity to their new VLAN by adding a new port group tagged to that VLAN number.

(Note: if you need to trunk the connection the Service Console/Management Network uses, change the VLAN tag before you adjust the physical switch port settings. You’ll lose connectivity to the host temporarily until you change the switch port settings.)

I set about trying to determine where each of the physical NIC ports on the hosts were plugged into their core switch. There are a few options to do this:

  1. Hope that the customer has proper documentation of their environment, from the initial setup and any changes that were made, indicating the switch ports. In this case, the customer did not.
  2. Hope that the switch has comments that indicate what is physically connected to it. In this case, there were no comments.
  3. Physically trace out each connection back to the switches. In this case, we were in the middle of a major winter storm in Kansas City, so I was working remote for the customer.
  4. Use networking commands on the switch to attempt to identify what is plugged into each port.

You might expect that the MAC addresses of the vSwitch’s individual NICs would be listed in the results of a “show mac address-table dynamic” on the switch — except they aren’t. You can see the vNIC this way, but not the pNICs.

If you open the vCenter GUI and go to the Configuration > Networking section, next to each of the physical adapters configured in a vSwitch, you’ll see a blue box. Click on it, and if you’re using Cisco switches (and why wouldn’t you) you’ll see all the data about the switch, port, and configuration of the network port.

You’ll also get these results if you’re running on a UCS chassis against a Nexus switch, but in a slightly different format. With the UCS and other blade chassis type systems you can actually find other ways to determine the switch port you’re connected to, but that’s a topic for another blog post (and once I get more experience on the UCS.)

What if none of this works?

If all this doesn’t work for you, make sure you’re using Cisco switches. CDP is a proprietary protocol, so your Dell, HP, Juniper, 3Com, Netgear, Trendnet, SuperCheapNet switches are probably going to give you any of this data.

However, as of ESXi 5.0, VMware does support Link Layer Discovery Protocol (LLDP), which is the IEEE standardized version of CDP. The problem is they only support it with Distributed vSwitches, which requires Enterprise Plus licensing. A lot of the environments I work in either don’t have that licensing and/or have not adopted Distributed vSwitches. For reasons unknown, VMware does not support LLDP on regular vSwitches. (For more information on how to use LLDP check out Ivo Beerens’ post.)

If you’ve got Cisco equipment, but it’s still not working, make sure CDP is enabled on your hosts. As of ESX 3.5, it should be by default but it may have been disabled. For more information on how to troubleshoot this check out VMware KB1003885.

Determining the layout of vSphere host memory

Memory utilization is important in VMware, most of the time it’s the most limiting factor in the virtual to physical consolidation ratio. Often times I’m tasked with assessing how upgradable a physical host’s current memory configuration is. It’s easy to see from the vSphere Client how much memory you have installed in a host, but when you’re upgrading you need to know exactly how that memory is laid our on your motherboard so you can get the most bang for your buck.

There are basically three ways to do this:

  1. Open up the case and see. This is going to require downtime (because you wouldn’t open the case while you’re running production systems, right?) This is all well and good because you can just vMotion your virtual machines to another host and shut it down. Problem is, if you’re having memory utilization issues, chances are you’re overcommitting on your hosts, so you’re going to need to shut down virtual machines to do this.
  2. Use an out-of band-management utility like DRAC or iLO. Great if your server has them configured, but a lot of people either don’t realize they have these or don’t bother to set them up until someone points out how useful they are. Usually to configure them requires a reboot of the host which means downtime, and I just explained why that’s probably not great in this situation.
  3. SSH into your hosts and run a couple of commands. This is what I’m going to explain how to do.

Everything I’m going to show you is documented from the VMware KB. If you’d rather refer to those go here for ESXi 4.x/5.x or go here for ESX 3.x/4.x. Make sure you know what version you’re checking, so you can use the right commands.

ESXi 4.x/5.x

The first thing you’ll need to do is enable SSH on your hosts. Best practice is to leave SSH off and only turn it on when you need it. You can enable it by opening up the vSphere Client, selecting the Host and Clusters view, and then selecting the host you want to enable SSH on in the left hand window. Select the Configuration tab, and then Security Profile from the options on the left. Under services you’ll see SSH. Click on Properties, select SSH from the list of services, and then press Options. In the window, press Start to enable the SSH service. Leave the settings that ask you about starting this service automatically set to manual. For security, you don’t want SSH turned on all the time. You’ll also get warnings from each host it’s enabled on if you leave it turned on. When we’re done you’ll want to come back here and disable SSH on your host. (Note: If you’ve previously closed port 22 on your ESXi firewall, you’ll need to open that back up. By default the port is open but the service is not running.)

At this point you need to SSH into your host as root. Keep in mind unless you joined your ESXi box to your Active Directory domain, you probably can’t just use your normal network account to get into the host this way. It’s going to be root or another local account you’ve created.

If you’re on Windows, I suggest using Putty. If you’re on a Mac or Linux box, no need to download anything extra as it’s all built in. Just open up Terminal and away you go.

(I’m normally a Mac user, but I access my work demo lab through a Windows 7 virtual machine running on VMware View. So here is the results from Putty.)

You’ll want to do is navigate to a location you can easily access through the vSphere datastore browser. The reason is we’re going to be running a command and outputting the results to a text file so we can easily get the information we want. I suggest using a local disk on the host, ISO/template datastore or maybe a shared datastore that you use for things like dumping host logs. The output file is going to just a few MBs, so it’s not really critical as long as it’s easily accessible. When we’re done we’re going to delete it from the host.

cd /vmfs/volumes/YOUR_DATASTORE

You’ll notice that the result for your command will change your current directory to something like this: /vmfs/volumes/4ea066d9-d9f09a90-c026–0025b5aa002c — This is normal. Do not be alarmed.

At this point we’re going to run the command that will query the system for all the physical hardware, and export it to a text file.

cim-diagnostic.sh > YOUR_SERVER_NAME.txt

You can call the file after the > whatever you want. Most of the time I keep it unique because I’m going to be doing this command on multiple systems and want to easily identify which one it came from.

At this point you can go back to the vSphere Client and open up the Datastore Browser on the datastore you ran the command on. You can get to this easily by clicking on the host in Host and Clusters and then under the Summary page, right clicking on the datastore listing and then Browse Datastore.

Use the Datastore Browser to download the file to your desktop. (Right click file > Download)

Now the problem with this file is that Notepad doesn’t know how to handle the way ESXi outputs the file, so when you open it up it looks a little something like this:

I would suggest opening the file in something like Notepad++ which is really far more useful and can read the log file correctly. It’s also helpful for other VMware logs that don’t save whitespace in a way Notepad likes. (Note, Mac users can open the file in TextEdit just fine.)

Run a search within the document and find the section that starts as Dumping instances of CIM_PhysicalMemory. You’ll see the first entry as Tag = 32.0 and if you scroll down all the way though the section it’ll go until run out of memory slots. For instance, the server I ran my export on is a Cisco UCS B250 with 46 memory slots, so the last entry will be 32.45.

The key bits of information here are things MaxMemorySpeed and Capacity if you’re trying to figure out what to buy. Capacity is listed in bytes, so 4294967296 is going to be a 4GB DIMM. There is also lots of other good information in the export such as the position of the DIMM on the motherboard, the node and channel the memory is utilized by, or if the slot is even in use, as well as things like serial numbers and part numbers.

At this point you can delete the file from the host, if you choose, either by utilizing the Datastore Browser or at the SSH session you may still have open.

rm YOUR_SERVER_NAME.txt

Now you can close your SSH session, and turn SSH back off on your host in the same section where you previously turned it on.

ESX 3.x/4.x

The method for obtaining this information on ESX is similar to the ESXi method explained above, the only real difference is that the command utilized is different and the output file isn’t as detailed (although it’s much easier to read.)

The first thing we’re going to need to do is enable SSH on the host. On ESX 3.x/4.x, SSH is disabled by default for the root account on an ESX host. The SSH service does not allow root logins. Non-root users are able to login with SSH, which you can then elevate this account to the root user. As an alternative to enabling SSH on your host, you can physically login to the console of the host and run the commands as well.

From VMware KB 8375637:

If you do not have any other users on the ESX host, you can create a new user by connecting directly to the ESX host with VMware Infrastructure (VI) or vSphere Client. Go to the Users & Groups tab, right-click on the Users list and select Add to open the Add New User dialog. Ensure that the Grant shell access to this user option is selected. These options are only available when connecting to the ESX host directly. They are not available if connecting to vCenter Server.

If you’re on Windows, I suggest using Putty. If you’re on a Mac or Linux box, no need to download anything extra as it’s all built in. Just open up Terminal and away you go.

(I’m normally a Mac user, but I access my work demo lab through a Windows 7 virtual machine running on VMware View. So here is the results from Putty.)

After logging in to your host with your regular user account we need to elevate to root user:

su -

You’ll be prompted for your root password. Enter it now.

You’ll want to do is navigate to a location you can easily access through the vSphere datastore browser. The reason is we’re going to be running a command and outputting the results to a text file so we can easily get the information we want. I suggest using a local disk on the host, ISO/template datastore or maybe a shared datastore that you use for things like dumping host logs. The output file is going to just a few MBs, so it’s not really critical as long as it’s easily accessible. When we’re done we’re going to delete it from the host.

cd /vmfs/volumes/YOUR_DATASTORE

You’ll notice that the result for your command will change your current directory to something like this: /vmfs/volumes/4ea066d9-d9f09a90-c026–0025b5aa002c — This is normal. Do not be alarmed.

At this point we’re going to run the command that will query the system for all the physical hardware, and export it to a text file.

smbiosDump > YOUR_SERVER_NAME.txt

You can call the file after the > whatever you want. Most of the time I keep it unique because I’m going to be doing this command on multiple systems and want to easily identify which one it came from.

At this point you can go back to the vSphere Client and open up the Datastore Browser on the datastore you ran the command on. You can get to this easily by clicking on the host in Host and Clusters and then under the Summary page, right clicking on the datastore listing and then Browse Datastore.

Use the Datastore Browser to download the file to your desktop. (Right click file > Download)

Run a search within the document and find the section that starts as Physical Memory Array. You should see a summary that lists how many slots the system has, as well as the maximum memory size. Then there will be an entry listed for each memory slot. For instance, on the Dell R710 I ran an export on, there were 18 slots for a maximum of 192GB. If there is memory installed in the slot you’ll see the size of the DIMM, otherwise you’ll see No Module Installed under size.

At this point you can delete the file from the host, if you choose, either by utilizing the Datastore Browser or at the SSH session you may still have open.

rm YOUR_SERVER_NAME.txt

Now you can close your SSH session.

vCOPS for View download is borked

I’ve been on a View 5.1 deployment with a customer all week, and part of the project involved deploying VMware vCenter Operations Manager (vCOPS) for View, version 1.01. I’ve done this a couple times before, and had no issues getting the Linux OVA base vApp configured. Then when I went to install the View adapter into a Windows VM, I got a strange message about how this installer was a 32-bit application and not able to run on a 64-bit system.

Two things wrong with this:

  1. Normally 32-bit apps run on 64-bit operating systems, unless they’re specifically configured not to.
  2. vCOPS for View is a 64-bit application, with a 64-bit installer. The system requirements state it can only run on Windows 2008 R2 or Windows 2003 R2 64-bit.

After playing around with the 1.01 installer, and attempting to download and start the installer for 1.0 just fine on the same system, I notice that the published file size on VMware.com is 22MB, but the 1.01 installer I was downloading was only 16MB. I ran an MD5 checksum on this file and it didn’t match the published checksum on the website either. The file creation date shows sometime in late December, while the published file date is somewhere in early October.

Eventually I was able to find a copy of a previously used 1.01 installer on another system, ran a checksum on it, and it matched the published checksum. Installed the adapter using this file and it worked just fine. Customer vCOPS environment is up and running.

I have a support case in with VMware right now letting them know about this issue, hopefully they get it corrected soon. I realize it’s not a particularly popular product compared to something like vSphere or even a View Connection Broker, but it’s hard to see how this could have gone on for a while (nearly a month) without someone else noticing?

TL;DR vCOPS for View 1.01 installer on vmware.com is screwed up, I’m working with VMware to get it fixed.

Faking an SSD drive in vSphere

Notes on tricking VMware into thinking a datastore is actually an SSD drive. Very useful if you’re in a lab environment and want to just test some of the features in vSphere 5 that center around flash storage (but don’t have the funds to dedicate to actually having it.)

However it’s also useful if you actually have flash storage in a production environment but for some reason vSphere isn’t recognizing that fact.