How to increase the amount of RAM available to virtual machines by tuning the virtualization host.
I recently migrated my production system over to a fresh install on ZFS with PVE version 6 which was overall a great success, but I had to do some quick research on memory tuning to fend off the infamous OOM (Out Of Memory) killer, and that is what I’ll be sharing with you today. Ideally, I would just add enough RAM to cover all of my VM allocations plus about 50% to 55% for ZFS and Proxmox, but I can’t justify that expense right now.
If you can afford to, just get more RAM ; if you can’t or don’t want to do that, the tips below will help you manage the memory you do have available.
There is a lot of talk about ZFS requiring error correcting (ECC) RAM, this is simply not true, ZFS does not require ECC RAM any more than other filesystems. That said it is always better to use ECC RAM regardless of the filesystem in use, but don’t panic if you don’t have ECC, I don’t have it either.
What is the OOM Killer
OOM Killer is a Linux process that prevents the system from running entirely out of memory by killing a process (likely one of your VMs) to free up some space in RAM.
Why not add swap space
Generally, OOM Killer can be kept at bay by adding swap space, but swap on ZFS is unstable at the time of writing and well known to cause complete system lock-up in high memory pressure scenarios. That leaves only a few options for utilizing swap; you could add swap space on another disk without RAID, which is undesirable since due to lack of redundancy losing that drive could potentially crash the whole system, or set up an MD RAID array on two more disks for swap space.
All of the SATA ports are in use on my system, so that’s not an option either, I probably should have reduced the size of the root ZFS volume during installation to make room for a pair of MD RAID partitions to use as swap space on the same disks as the installation, but hindsight is 20/20, right? So in my case, I’m just going to be running a swap-less system, which means the memory tuning that should probably be done even with swap is all the more important.
1) VM RAM Allocation
The simplest and most important thing you should do is not over-commit your RAM. Sure, with KSM and memory ballooning you can over-commit RAM, but your system will be far more stable if you don’t.
The best approach is to leave enough RAM for the ZFS ARC (Adaptive Replacement Cache) unallocated, plus at least 1 GiB or more for the Proxmox host, I will get into ARC more down below. The default maximum ARC size is 50% of the total RAM installed in the system.
I have 62.83 GiB of total RAM according to the Proxmox Web GUI, which leaves me approximately 30 GiB of maximum RAM available to safely use for VM allocation, but I was able to increase that to almost 54 GiB of RAM by tuning the ARC.
2) KSM Tuning
KSM (Kernel Same-page Merging) is a great tool for getting a little more out of your precious RAM, it’s enabled by default in Proxmox, but the default settings are not ideal for use on a ZFS system where RAM usage can spike wildly the first time a disk-heavy operation takes place (e.g. a backup or clone from one zpool to another).
KSM works by reducing duplicate memory pages to a single page, so when multiple VMs have the same resource stored in memory it is only stored once in RAM and shared between then instead of wasting RAM space with multiple copies of the same resource. This works best when you are running the same operating system on the majority of your VMs.
KSM Tuning won’t increase the amount of RAM you can safely allocate to your VMs, but it will give you an extra buffer for periods high RAM utilization, and that makes me feel a little better about allocating all the RAM that I safely can to VMs, which you should do, because why have all that expensive RAM if you aren’t using it?
Note: The trade-off for using KSM is an increased attack surface for
side channel exploits
. It is up to you to decide on the benefit vs risk of KSM, if you should decide against using KSM you should disable it by running the command systemctl disable ksmtuned
from a shell session on your Proxmox host.
I only changed one setting here, KSM_THRES_COEF
, which determines the threshold of available RAM that will trigger KSM to start doing its thing. There are more tuning options available and if you’d like to really dive into them there is some great information regarding these options in the
Red Hat Online Documentation
. I want to trigger KSM a little sooner, but not too soon, as KSM uses up CPU cycles, and that’s a waste if you have plenty of RAM available.
The default value for KSM_THRES_COEF
is 20, which means KSM will trigger if less than 20% of the total system RAM is free, a formula for this calculation is:
Threshold * Total RAM / 100
I have 62.83 GiB of total RAM according to the Proxmox Web GUI, so on my system the default value is:
20 * 62.83 GiB / 100 = 12.566 GiB
Which means KSM will start working only when my RAM usage reaches 50.264 GiB (62.83 GiB total RAM - 12.566 GiB Free RAM threshold = 50.264 GiB trigger point)
I played with the numbers a bit and came up with a value of 35 for the KSM_THRES_COEF
to trigger KSM at about 41 GiB of RAM usage:
35 * 62.83 GiB / 100 = 21.9905 GiB 62.83 GiB - 21.9905 GiB = 40.8395 GiB Threshold
Now it’s as easy as editing the file /etc/ksmtuned.conf
, uncomment the line for KSM_THRES_COEF
by removing the #
at the beginning, and change the value from 20 to your desired value. It seemed to me like this took effect instantly without restarting the system or any processes, but we’re going to reboot the system after ARC tuning anyway so I wouldn’t worry about it.
3) ARC Tuning: /etc/modprobe.d/zfs.conf
ARC (Adaptive Replacement Cache) is a function of ZFS that stores the most commonly accessed data in RAM, making for extremely fast reads for those resources. That’s fantastic (even after reducing the ARC on my system there is a noticeable speed boost), but the default maximum size for ARC is 50% of the total system RAM. On a file server, I’d say go even higher than that, but Proxmox is not a file server. On a system with enough RAM this default is probably ideal, but RAM is expensive and I don’t have enough right now, so I need to dial this back.
There are mixed answers as to how much RAM you should dedicate to ARC, I did, however, see reference quite often to 1 GiB of RAM per 1 TiB of disk space, but no less than 8 GiB, and that sounds reasonable to me.
Conveniently, I have about 7.5 TiB total disk space, so I will use 8 GiB for my maximum ARC size, I’ve read that the minimum is also required to be set for the maximum to take effect, so I’ll use 4 GiB for the minimum, I based that value on absolutely nothing because I expect the ARC to be at or near its maximum capacity most of the time. These values must be expressed in bytes, so more math is required:
8 GiB * 1024 = 8192 MiB 8192 MiB * 1024 = 8388608 KiB 8388608 KiB * 1024 = 8589934592 Bytes
Create the file /etc/modprobe.d/zfs.conf
, and add the following lines.
Replace the numerical values with your calculations above.
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=8589934592
Optional: It is suggested that adding the flagoptions zfs zfs_flags=0x10
to the /etc/modprobe.d/zfs.conf
file may help mitigate the risk of using non-ECC RAM, at the cost of some performance loss (I’m not using this in production, but it is worth mentioning here).
Note: There are some reports of arc_prune
causing high CPU usage and soft lockups, there is some good information and suggested mitigations on
this GitHub issue
.
If your system boots from ZFS, this file will not be available until after ZFS is started since it’s stored on a ZFS volume, so you will need to update the initial RAM file system to pick up the changes before the root ZFS volume is mounted.
update-initramfs -u
If you are on an EFI system, you will also need to update the kernel list in the EFI boot menu so the updated initial RAM file system is used.
pve-efiboot-tool refresh
After rebooting the system, use the command arcstat
to make sure the changes have been applied, the value of the last column (c
) should be the maximum ARC size you’ve set in /etc/modprobe.d/zfs.conf
(expressed in GiB).