VMware Cloud Foundation 4.0.1 – issue resolved

A month ago or so I posted about having some issues with VMware Cloud Foundation 4.0.1 in a greenfield deployment that uses fully compatible servers underneath.

Management domain was never an issue that was deployed many times out of the box – repeated it three times – but the first workload domain creation workflow was having issues – the very same issue every single time.

I was wondering why is it that hard to consume VSAN ready servers that have four physical network interfaces. In VCF 4.0.1. Cloud Builder can deploy the management domain with 4 pNICs as in the deployment worksheet there is an option for that. But Cloud Builder deploys the management domain only, not a single workload domain.

So int the file that Cloud Builder conumes there are many options how those NICs should be configured.

Options are:

  • 1 vDS – 2 PNIC – All in one, management VLAN,vMotion,VSAN,Host TEP, Edge TEP
  • 2 vDS – 2 PNIC/VDS – First VDS has management VLAN,vMotion,VSAN, while second VDS has Host TEP, Edge TEP
  • 2 vDS – 2 PNIC/VDS – everything on the first VDS except VSAN, second vDS dedicated to VSAN

My choice was here “Profile-2” and it was fine like that, but the table has one very important choice still in it.

In cell K27 you can choose between “Yes” and “No”. Regardless of you choice 3 NSX Managers/Controllers will be deployed into the management domain even if you don’t need it, furthermore all hosts in the management domain will have 2 TEP IPs and interfaces as well.

If the selected option is “Yes” in cell K27, than two Edge VMs will be deployed in addition, so T0/T1 and optionally BGP will be at your disposal when installation ends. Since we talk about 4.0.1 which loves vSphere 7 and NSX 3.0.X, effectively we are talking about two distributed vSwitches here – if we selected “Profile-2” or “Profile-3” before.

In case of setting cell K27 to “No” the above shown part of the table is grey, the Edge VMs will not be created however hosts will still have their TEP interfaces deployed. Both vDS will be created so the host vmkernel ports that represent the TEP interfaces will be connected to the second VDS, but no port group will be created since vmkernel ports are not consuming any PG, since we talk about vmkernel ports.

So after selecting “No” the deployment succseeds, no issues there. Once done all futher jobs are to be done by SDDC Manager. While Cloud Builder is prepared for 4 pNIC deployments, SDDC Manager still not, so GUI can still deploy 2 pNIC workflow domains, 4 pNIC requires the API way.

My prepared JSON to do this was send to the API of the SDDC and than the expetion error was shown – first picutre in this article.

Every single server I have for VCF have 8 NICs in total, 4 of them are on-board and 1Gbit – vmnic0-1-2-3 – and 4 are 10Gbit – two, two port cards.

Many troubleshooting steps were suggested:

  • the first four ports must be disabled. It did not help.
  • propably the SDDC Manager requires consecutive vmnic numbering so my JSON is flawed as I want to add vmnic4 and vmnic6 to the first vDS, vmnic5 and vmnic7 to the second. I doubt that this is an issue as vmnic numbering goes from card to card but port to port based on a card. Changed the alias anyway, but did not help.
  • probably the SDDC Manager expects vmnic0 to start with. I changed the alias for all NICs again to start with that, but did not help.

As my last option I tested the deployment with 2 pNICs only and from the SDDC Manager GUI. Aaaaaaaaaaaaaaaaaand that failed the same way as all the other attempts.

This was the point when a GSS support ticket was raised. After the third log collection request they solved it. The issue is caused by the “No” option in cell “K27”. As a result of that, second vDS was created but no port group on it since nothing – no Edge VM – would consume those. While this happens in the management domain the SDDC Manager still checks up on it for some strange reason when a workload domain is deployed and it fails.

Following the deeper investigation by Engineering, this seems to be caused by the fact that the mgmt dvs named ‘mgmt-vds02’ which is only used for NSX-T doesn’t have any portgroups defined (which is expected behaviour). However, the GenerateViInternalModelAction doesn’t check for null value of the portgroups which is causing a NullPointerException.

So what is the solition than? While it is allowed and supported – probably solved in some newer releases in the future – do not select “No” in cell K27 at initial build if you use “Profile-2”. I suspect that even if I would manually create a PG on the second vDS after initial deployment the result could be the same. GSS sent a command that a record was filled with an empty list and that solved the issue – I am not releasing that command as it needs the ID of the second vDS and it manipulates the postgre under SDDC Manager which is really dangerous.

Summary

While it is totally allowed to set cell “K27” to no, but it seems I am the first one on this planed to select that together with “Profile-2”. Wondering how this was tested before the release as I can recreate the issue 10 times our of 10.

Is this demotivating? A little, since this has delayed a POC by one month and since an exception in the code is not something I can troubleshoot since I don’t see the code of SDDC Manager.

Posted in VCF