今日 GKE EOL,kubelet 自动升级至1.28后,Java程序在启动后无法识别资源清单中的限制,被大量OOMKill

Deployment清单中已经配置了资源限制,例如下面的参数

yaml
1
2
3
4
5
resources:
    limits:
      memory: "1Gi"
    requests:
      memory: "600Mi"

JAVA_OPS参数配置是使用百分比

bash
1
-XX:+UseContainerSupport -XX:InitialRAMPercentage=70.0 -XX:MaxRAMPercentage=70.0

但是启动后无法识别参数,使用 gcloud 登录到主机内查看 jvm 运行状态(因为容器使用 distroless)

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
project-20220325-asia-east-2-pool-221ab289-hgnf ~ # nsenter -t 274655 --mount --uts --ipc --net --pid /opt/java/openjdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintContainerInfo -version
OSContainer::init: Initializing Container Support
Detected cgroups v2 unified hierarchy
Path to /cpu.max is /sys/fs/cgroup/system.slice/sshd.service/cpu.max
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.max failed, No such file or directory
CPU Quota is: -2
Path to /cpu.max is /sys/fs/cgroup/system.slice/sshd.service/cpu.max
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.max failed, No such file or directory
CPU Period is: -2
Path to /cpu.weight is /sys/fs/cgroup/system.slice/sshd.service/cpu.weight
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.weight failed, No such file or directory
Raw value for CPU Shares is: -2
OSContainer::active_processor_count: 16
CgroupSubsystem::active_processor_count (cached): 16
total physical memory: 67435528192
Path to /memory.max is /sys/fs/cgroup/system.slice/sshd.service/memory.max
Open of file /sys/fs/cgroup/system.slice/sshd.service/memory.max failed, No such file or directory
Memory Limit is: -2
container memory limit failed: -2, using host value 67435528192
CgroupSubsystem::active_processor_count (cached): 16
Path to /cpu.max is /sys/fs/cgroup/system.slice/sshd.service/cpu.max
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.max failed, No such file or directory
CPU Quota is: -2
Path to /cpu.max is /sys/fs/cgroup/system.slice/sshd.service/cpu.max
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.max failed, No such file or directory
CPU Period is: -2
Path to /cpu.weight is /sys/fs/cgroup/system.slice/sshd.service/cpu.weight
Open of file /sys/fs/cgroup/system.slice/sshd.service/cpu.weight failed, No such file or directory
Raw value for CPU Shares is: -2
OSContainer::active_processor_count: 16
openjdk version "1.8.0_422"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_422-b05)
OpenJDK 64-Bit Server VM (Temurin)(build 25.422-b05, mixed mode)

异常的主机

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
project-20220325-asia-east-2-pool-221ab289-hgnf ~ # nsenter -t 281811 --mount --uts --ipc --net --pid /usr/local/jdk1.8.0_351/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintContainerInfo -version
OSContainer::init: Initializing Container Support
Required cgroup memory subsystem not found
java version "1.8.0_351"
Java(TM) SE Runtime Environment (build 1.8.0_351-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.351-b10, mixed mode)

project-20220325-asia-east-2-145239dd-0fzj ~ # nsenter -t 1517879 --mount --uts --ipc --net --pid /usr/local/jdk1.8.0_351/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintContainerInfo -version
OSContainer::init: Initializing Container Support
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
OSContainer::active_processor_count: 16
OSContainer::active_processor_count (cached): 16
container memory limit failed: -2, using host value
container memory limit failed: -2, using host value
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
OSContainer::active_processor_count: 16
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
subsystem_file_line_contents: subsystem path is NULL
OSContainer::active_processor_count: 16
java version "1.8.0_351"
Java(TM) SE Runtime Environment (build 	-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.351-b10, mixed mode)

查询到 jdk 版本不同,搜索 jdk 版本,发现 jdk 低版本对 cgroup2 无法识别出内存限制, 8u381 后版本才修复,而使用的是 8u351

Detected container memory limit may exceed physical machine memory

Reference