深入理解Kubernetes - 基于OOMKill的QoS的设计

Overview

阅读完本文，您当了解

Linux oom kill
Kubernetes oom 算法
Kubernetes QoS

本文只是个人理解，如果有大佬觉得不是这样的可以留言一起讨论，参考源码版本为 1.18.20，与高版本相差不大

什么是OOM Kill

当你的Linux机器内存不足时，内核会调用Out of Memory (OOM) killer来释放一些内存。这经常在运行许多内存密集型进程的服务器上遇到。

OOM Killer是如何选择要杀死的进程的？

Linux内核为每个运行的进程分配一个分数，称为 oom_score，==显示在内存紧张时终止该进程的可能性有多大==。该 Score 与进程使用的内存量成比例。 Score 是进程使用内存的百分比乘以10。因此，最大分数是 $100% \times 10 = 1000$。此外，如果一个进程以特权用户身份运行，那么与普通用户进程相比，它的 oom_score 会稍低。

在主发行版内核会将 /proc/sys/vm/overcommit_memory 的默认值设置为零，这意味着进程可以请求比系统中当前可用的内存更多的内存。这是基于以下启发式完成的：分配的内存不会立即使用，并且进程在其生命周期内也不会使用它们分配的所有内存。如果没有过度使用，系统将无法充分利用其内存，从而浪费一些内存。过量使用内存允许系统以更有效的方式使用内存，但存在 OOM 情况的风险。占用内存的程序会耗尽系统内存，使整个系统陷入瘫痪。当内存太低时，这可能会导致这样的情况：即使是单个页面也无法分配给用户进程，从而允许管理员终止适当的任务，或者内核执行重要操作，例如释放内存。在这种情况下，OOM Killer 就会介入，并将该进程识别为牺牲品，以保证系统其余部分的利益。

用户和系统管理员经常询问控制 OOM Killer 行为的方法。为了方便控制，引入了 /proc/<pid>/oom_adj 来防止系统中的重要进程被杀死，并定义进程被杀死的顺序。 oom_adj 的可能值范围为 -17 到 +15。Score 越高，相关进程就越有可能被 OOM-killer Kill。如果 oom_adj 设置为 -17，则 OOM Killer 不会 Kill 该进程。

oom_score 分数为 1 ~ 1000，值越低，程序被杀死的机会就越小。

oom_score 0 表示该进程未使用任何可用内存。
oom_score 1000 表示该进程正在使用 100% 的可用内存，大于1000，也取1000。

谁是糟糕的进程？

在内存不足的情况下选择要被终止的进程是基于其 oom_score 。糟糕进程 Score 被记录在 /proc/<pid>/oom_score 文件中。该值是基于系统损失的最小工作量、回收的大量内存、不终止任何消耗大量内存的无辜进程以及终止的进程数量最小化（如果可能限制在一个）等因素来确定的。糟糕程度得分是使用进程的原始内存大小、其 CPU 时间（utime + stime）、运行时间（uptime - 启动时间）以及其 oom_adj 值计算的。进程使用的内存越多，得分越高。进程在系统中存在的时间越长，得分越小。

列出所有正在运行的进程的OOM Score

bash

1
2
printf 'PID\tOOM Score\tOOM Adj\tCommand\n'
while read -r pid comm; do [ -f /proc/$pid/oom_score ] && [ $(cat /proc/$pid/oom_score) != 0 ] && printf '%d\t%d\t\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$(cat /proc/$pid/oom_score_adj)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k 2nr

如何检查进程是否已被 OOM 终止

最简单的方法是查看grep系统日志。在 Ubuntu 中：grep -i kill /var/log/syslog。如果进程已被终止，您可能会得到类似的结果

bash

1
my_process invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

Kubernetes的QoS是如何设计的

Kubernetes 中 Pod 存在一个 “服务质量等级” (QoS)，它保证了Kubernetes 在 Node 资源不足时使用 QoS 类来就驱逐 Pod 作出决定。这个 QoS 就是基于 OOM Kill Score 和 Adj 来设计的。

对于用户来讲，Kubernetes Pod 的 QoS 有三类，这些设置是被自动设置的，除此之外还有两种单独的等级：“Worker 组件”，总共 Pod QoS 的级别有5种

Kubelet
KubeProxy
Guaranteed
Besteffort
Burstable

这些在 pkg/kubelet/qos/policy.go 中可以看到，其中 Burstable 属于一个动态的级别。

1
2
3
4
5
6
7
8
const (
	// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
	KubeletOOMScoreAdj int = -999
	// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
	KubeProxyOOMScoreAdj  int = -999
	guaranteedOOMScoreAdj int = -998
	besteffortOOMScoreAdj int = 1000
)

其中最重要的分数就是 Burstable，这保证了驱逐的优先级，他的算法为：$1000 - \frac{1000 \times Request}{memoryCapacity}$ ，Request 为 Deployment 这类清单中配置的 Memory Request 的部分，memoryCapacity 则为 Node 的内存数量。

例如 Node 为 64G，Pod Request 值配置了 2G，那么最终 oom_score_adj 的值为 $1000 - \frac{1000 \times Request}{memoryCapacity} = 1000 - \frac{1000\times2}{64} = 968$

这部分可以在下面代码中看到，其中算出的值将被写入 /proc/{pid}/oom_score_adj 文件内

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
	if types.IsNodeCriticalPod(pod) {
		// Only node critical pod should be the last to get killed.
		return guaranteedOOMScoreAdj
	}

	switch v1qos.GetPodQOS(pod) {
	case v1.PodQOSGuaranteed:
		// Guaranteed containers should be the last to get killed.
		return guaranteedOOMScoreAdj
	case v1.PodQOSBestEffort:
		return besteffortOOMScoreAdj
	}

	// Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
	// we want to protect Burstable containers that consume less memory than requested.
	// The formula below is a heuristic. A container requesting for 10% of a system's
	// memory will have an OOM score adjust of 900. If a process in container Y
	// uses over 10% of memory, its OOM score will be 1000. The idea is that containers
	// which use more than their request will have an OOM score of 1000 and will be prime
	// targets for OOM kills.
	// Note that this is a heuristic, it won't work if a container has many small processes.
	memoryRequest := container.Resources.Requests.Memory().Value()
	if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
		if cs, ok := podutil.GetContainerStatus(pod.Status.ContainerStatuses, container.Name); ok {
			memoryRequest = cs.AllocatedResources.Memory().Value()
		}
	}
	oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
	// A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
	// that burstable pods have a higher OOM score adjustment.
	if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
		return (1000 + guaranteedOOMScoreAdj)
	}
	// Give burstable pods a higher chance of survival over besteffort pods.
	if int(oomScoreAdjust) == besteffortOOMScoreAdj {
		return int(oomScoreAdjust - 1)
	}
	return int(oomScoreAdjust)
}

到此可以了解到 Pod QoS 级别为

Kubelet = KubeProxy = -999
Guaranteed = -998
1000(Besteffort) > Burstable > -998 (Guaranteed)
Besteffort = 1000

那么在当 Node 节点内存不足时，发生驱逐的条件就会根据 oom_score_adj 完成，但当 Pod 中程序使用内存达到了 Limits 限制，此时的OOM Killed和上面阐述的无关。

Reference

^[1] Taming the OOM killer

^[2] How does the OOM killer decide which process to kill first?

Overview#

什么是OOM Kill#

OOM Killer是如何选择要杀死的进程的？#

谁是糟糕的进程？#

列出所有正在运行的进程的OOM Score#

如何检查进程是否已被 OOM 终止#

Kubernetes的QoS是如何设计的#

Reference#

相关阅读