18Jan

vRouter Agent and Memory Allocation

When vRouter starts, the kernel module allocates memory for both Flow and Bridge tables, which are Hash tables. Therefore, it is required that the memory allocation is Contiguous, meaning that the memory being allocated for the process consists of consecutive blocks, not fragmented across the memory space. How does this affect vRouter as a process? well, if you try to restart it on a system that is short on memory, there is a high probability that it doesn’t come up, primarily due to the memory allocation failure. My own observation has seen this behavior occur when compute nodes have less than ~15GB of free memory. It may not happen, but it can. Again, not directly because the system is low on free memory, but that it does not have a sufficient number of contiguous pages for such allocations. Having more free memory definitely helps though in improving the odds. When the error triggers, something along the lines of the following gets populated in contrail-vrouter-agent.log:

contrail-vrouter-agent: controller/src/vnsw/agent/vrouter/ksync/ksync_memory.cc:100: void KSyncMemory::Mmap(bool): Assertion `table_size_ != 0' failed.

This greatly limits the possibilities of live-patching a system that is running vRouter. A workaround that is usually recommended to avoid such situation of memory fragmentation is to reboot the node, forcing vRouter Kernel Module to be inserted immediately after OS boot up before the memory gets drained or fragmented by different processes. If you have workloads that are running over vRouter Kernel-Mode, it comes down to how much time you can sustain with vRouter down.

You might think: “So where is the resiliency / auto-healing / migration of VMs here?”. Trust me when I tell you that I’ve seen production environments where these options were simply not possible due to infrastructure design constraints. One of those use cases did not even involve vRouter directly. It was more related to a critical VM running on a compute node, with vRouter and SR-IOV interfaces at the same time, and no option to go for a node reboot following a minor release upgrade (VM2 in the following picture):

vRouter SR-IOV Mode

source: https://tungstenfabric.github.io/website/Tungsten-Fabric-Architecture.html#vrouter-deployment-options

Working around this specific use case limitations is a bit further down in the post, but first let’s explore how vRouter recent releases have options to navigate around this issue.

HugePages:

Starting in Contrail Release 2005, vRouter now has the option to use Hugepages in Kernel-Mode in environments deployed using Red Hat Openstack or Juju Charms, along side the long supported DPDK-Mode. This enables vRouter core processes to utilize Hugepages for allocating memory for Flow and Bridge Tables, similar to how forwarding benefits from them in DPDK (except being in user space, obviously). Hugepages, in a nutshell, allow you to to reserve chunks of memory (known as pages) in a larger than usual size (1GB or 2MB pages, compared to the default 4KB pages), and allow processes to map them for their benefit. This enhances performance by reducing the number of resources required to access page table entries (e.g. using a single 1GB page instead of switching between 256K 4KB fragments):

Memory and HugePages Pools

Coming back to the case we faced, Hugepages were present on the node for usage by VMs. After testing various tricks on how to bring vRouter up without rebooting the node or disrupting the services running over SR-IOV, an idea to manipulate free memory by adjusting the Hugepages sparked in a discussion with the team. Note that the release we’ve been working was an older one that did not support Hugepages for vRouter Kernel-Mode, but Hugepages were present for sake of performance enhancement of VMs. Hugepages are usually allocated on those production nodes at boot, which can be checked in grub or by looking at /proc/cmdline:

[root@overcloud-sriov3 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.6.3.el7.x86_64 crashkernel=auto rhgb quiet intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=340 isolcpus=2,3,4,5 processor.max_cstate=1 intel_idle.max_cstate=1 intel_pstate=disable nosoftlockup

Key values are hugepagesz, defining whether we’re using 1GB or 2MB pages or both, and hugepages, which sets the total number of pages. In the example above, 340GB of memory on that node is being allocated for Hugepages

Monitoring systems that look at free memory values, let’s say from free -mh, will report the memory utilization wrongly, as they do not take into account that while the memory is allocated to Hugepages, it does not mean that they are in fact being used by processes.

Something that gives you more insight is checking /proc/meminfo:

[root@overcloud-sriov3 ~]# cat /proc/meminfo | egrep "HugePages_Total|HugePages_Free"
HugePages_Total:     340
HugePages_Free:      28

With some unused 1G Hugepages at our disposal, freeing them at runtime, and letting vRouter utilize that room to find the contiguous chunks it needs did the trick:

[root@overcloud-sriov3 ~]# contrail-status
== Contrail vRouter ==
supervisor-vrouter:           active
contrail-vrouter-agent        failed
contrail-vrouter-nodemgr      active

[root@overcloud-sriov3 ~]# sysctl -a | grep "vm.nr_hugepages"
vm.nr_hugepages = 340
vm.nr_hugepages_mempolicy = 340

[root@overcloud-sriov3 ~]# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
170

[root@overcloud-sriov3 ~]# sysctl -w "vm.nr_hugepages=320"
vm.nr_hugepages = 320

[root@overcloud-sriov3 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@overcloud-sriov3 ~]# systemctl restart supervisor-vrouter

[root@overcloud-sriov3 ~]# contrail-status
== Contrail vRouter ==
supervisor-vrouter:           active
contrail-vrouter-agent        active
contrail-vrouter-nodemgr      active

The 170 above represents the number of Hugepages per NUMA Node. As this system had two NUMAs, 170 was allocated from each NUMA. And yes, applying the change using sysctl isn’t the only way. You can do it using this as well:

[root@overcloud-sriov3 ~]# echo 160 > /sys/devices/system/node/node_id/hugepages/hugepages-1048576kB/nr_hugepages

Either way, make sure to verify how much free memory you have before doing this, and rolling back once done if you would like to.

Share this Story

6 comments

  1. Mohammed Aljaaidi

    Another great post!
    Thanks for sharing.

  2. Very nice article. One question, using Hugepages minimizes using fragmented memory locations but doesn’t eliminate it. right?

    • Yes you’re correct. That’s why it is recommended to allocate Hugepages at boot time, not at runtime. Runtime allocations are more prone to errors as the memory may have been already fragmented (If you want a 2MB page, you need 512x4KB free contiguous pages).

  3. Yet another great post ! thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Written with love ♥