When vRouter starts, the kernel module allocates memory for both Flow and Bridge tables, which are Hash tables. Therefore, it is required that the memory allocation is Contiguous, meaning that the memory being allocated for the process consists of consecutive blocks, not fragmented across the memory space. How does this affect vRouter as a process? well, if you try to restart it on a system that is short on memory, there is a high probability that it doesn’t come up, primarily due to the memory allocation failure. My own observation has seen this behavior occur when compute nodes have less than ~15GB of free memory. It may not happen, but it can. Again, not directly because the system is low on free memory, but that it does not have a sufficient number of contiguous pages for such allocations. Having more free memory definitely helps though in improving the odds. When the error triggers, something along the lines of the following gets populated in contrail-vrouter-agent.log:
contrail-vrouter-agent: controller/src/vnsw/agent/vrouter/ksync/ksync_memory.cc:100: void KSyncMemory::Mmap(bool): Assertion `table_size_ != 0' failed.
This greatly limits the possibilities of live-patching a system that is running vRouter. A workaround that is usually recommended to avoid such situation of memory fragmentation is to reboot the node, forcing vRouter Kernel Module to be inserted immediately after OS boot up before the memory gets drained or fragmented by different processes. If you have workloads that are running over vRouter Kernel-Mode, it comes down to how much time you can sustain with vRouter down.
You might think: “So where is the resiliency / auto-healing / migration of VMs here?”. Trust me when I tell you that I’ve seen production environments where these options were simply not possible due to infrastructure design constraints. One of those use cases did not even involve vRouter directly. It was more related to a critical VM running on a compute node, with vRouter and SR-IOV interfaces at the same time, and no option to go for a node reboot following a minor release upgrade (VM2 in the following picture):
Working around this specific use case limitations is a bit further down in the post, but first let’s explore how vRouter recent releases have options to navigate around this issue.
HugePages:
Starting in Contrail Release 2005, vRouter now has the option to use Hugepages in Kernel-Mode in environments deployed using Red Hat Openstack or Juju Charms, along side the long supported DPDK-Mode. This enables vRouter core processes to utilize Hugepages for allocating memory for Flow and Bridge Tables, similar to how forwarding benefits from them in DPDK (except being in user space, obviously). Hugepages, in a nutshell, allow you to to reserve chunks of memory (known as pages) in a larger than usual size (1GB or 2MB pages, compared to the default 4KB pages), and allow processes to map them for their benefit. This enhances performance by reducing the number of resources required to access page table entries (e.g. using a single 1GB page instead of switching between 256K 4KB fragments):
Coming back to the case we faced, Hugepages were present on the node for usage by VMs. After testing various tricks on how to bring vRouter up without rebooting the node or disrupting the services running over SR-IOV, an idea to manipulate free memory by adjusting the Hugepages sparked in a discussion with the team. Note that the release we’ve been working was an older one that did not support Hugepages for vRouter Kernel-Mode, but Hugepages were present for sake of performance enhancement of VMs. Hugepages are usually allocated on those production nodes at boot, which can be checked in grub or by looking at /proc/cmdline:
[root@overcloud-sriov3 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.6.3.el7.x86_64 crashkernel=auto rhgb quiet intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=340 isolcpus=2,3,4,5 processor.max_cstate=1 intel_idle.max_cstate=1 intel_pstate=disable nosoftlockup
Key values are hugepagesz, defining whether we’re using 1GB or 2MB pages or both, and hugepages, which sets the total number of pages. In the example above, 340GB of memory on that node is being allocated for Hugepages
Something that gives you more insight is checking /proc/meminfo:
[root@overcloud-sriov3 ~]# cat /proc/meminfo | egrep "HugePages_Total|HugePages_Free" HugePages_Total: 340 HugePages_Free: 28
With some unused 1G Hugepages at our disposal, freeing them at runtime, and letting vRouter utilize that room to find the contiguous chunks it needs did the trick:
[root@overcloud-sriov3 ~]# contrail-status == Contrail vRouter == supervisor-vrouter: active contrail-vrouter-agent failed contrail-vrouter-nodemgr active [root@overcloud-sriov3 ~]# sysctl -a | grep "vm.nr_hugepages" vm.nr_hugepages = 340 vm.nr_hugepages_mempolicy = 340 [root@overcloud-sriov3 ~]# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 170 [root@overcloud-sriov3 ~]# sysctl -w "vm.nr_hugepages=320" vm.nr_hugepages = 320 [root@overcloud-sriov3 ~]# echo 3 > /proc/sys/vm/drop_caches [root@overcloud-sriov3 ~]# systemctl restart supervisor-vrouter [root@overcloud-sriov3 ~]# contrail-status == Contrail vRouter == supervisor-vrouter: active contrail-vrouter-agent active contrail-vrouter-nodemgr active
The 170 above represents the number of Hugepages per NUMA Node. As this system had two NUMAs, 170 was allocated from each NUMA. And yes, applying the change using sysctl isn’t the only way. You can do it using this as well:
[root@overcloud-sriov3 ~]# echo 160 > /sys/devices/system/node/node_id/hugepages/hugepages-1048576kB/nr_hugepages
Either way, make sure to verify how much free memory you have before doing this, and rolling back once done if you would like to.
Another great post!
Thanks for sharing.
Thanks a lot buddy!
Very nice article. One question, using Hugepages minimizes using fragmented memory locations but doesn’t eliminate it. right?
Yes you’re correct. That’s why it is recommended to allocate Hugepages at boot time, not at runtime. Runtime allocations are more prone to errors as the memory may have been already fragmented (If you want a 2MB page, you need 512x4KB free contiguous pages).
Yet another great post ! thanks
Thanks buddy!