本文记录了在使用 ceph 集群时遭遇到的内存问题,以及引用和参考一些资料用于对在 ceph 集群使用时的内存预估。

OSD的内存需求

如何评估 Ceph OSD 所需的硬件也是对于集群选型,集群优化的一个必要条件,这里主要找到两个可靠的参考资料用于评估 OSD 内存配置大小

IBM Storage Ceph

IBM Storage Ceph 提供了一个运行 Ceph 用于预估系统配置的一个最小推荐列表 [1],个人感觉可以参考这些信息用于自己集群的优化。主要用于容器化的 Ceph 集群

ProcessCriteriaMinimum Recommended
ceph-osd-containerProcessor1x AMD64 or Intel 64 CPU CORE per OSD container
RAMMinimum of 5 GB of RAM per OSD container
OS Disk1x OS disk per host
OSD Storage1x storage drive per OSD container. Cannot be shared with OS Disk.
block.dbOptional, but IBM recommended, 1x SSD or NVMe or Optane partition or lvm per daemon. Sizing is 4% of block.data for BlueStore for object, file, and mixed workloads and 1% of block.data for the BlueStore for Block Device, Openstack cinder, and Openstack cinder workloads.
block.walOptionally, 1x SSD or NVMe or Optane partition or logical volume per daemon. Use a small size, for example 10 GB, and only if it’s faster than the block.db device.
Network2x 10 GB Ethernet NICs
ceph-mon-containerProcessor1x AMD64 or Intel 64 CPU CORE per mon-container
RAM3 GB per mon-container
Disk Space10 GB per mon-container, 50 GB Recommended
Monitor DiskOptionally, 1x SSD disk for Monitor rocksdb data
Network2x 1 GB Ethernet NICs, 10 GB Recommended
Prometheus20 GB to 50 GB under /var/lib/ceph/ directory created as a separate file system to protect the contents under /var/ directory.
ceph-mgr-containerProcessor1x AMD64 or Intel 64 CPU CORE per mgr-container
RAM3 GB per mgr-container
Network2x 1 GB Ethernet NICs, 10 GB Recommended
ceph-radosgw-containerProcessor1x AMD64 or Intel 64 CPU CORE per radosgw-container
RAM1 GB per daemon
Disk Space5 GB per daemon
Network1x 1 GB Ethernet NICs
ceph-mds-containerProcessor1x AMD64 or Intel 64 CPU CORE per mds-container
RAM3 GB per mds-container This number is highly dependent on the configurable MDS cache size. The RAM requirement is typically twice as much as the amount set in the mds_cache_memory_limit configuration setting. Note also that this is the memory for your daemon, not the overall system memory.
Disk Space2 GB per mds-container, plus considering any additional space required for possible debug logging, 20 GB is a good start.
Network2x 1 GB Ethernet NICs, 10 GB Recommended Note that this is the same network as the OSD containers. If you have a 10 GB network on your OSDs you should use the same on your MDS so that the MDS is not disadvantaged when it comes to latency.

Hardware Recommendations

Ceph 官方也提供了相应的硬件配置推荐,关键参数写的比较清晰,但实际的规模比较模棱两可,也是可以提供一些参考的,并且每个版本的 Ceph 所推荐的硬件也是不相同的。

下表是 Ceph nautilus 的推荐最小硬件 [2]

ProcessCriteriaMinimum Recommended
ceph-osdProcessor1x 64-bit AMD-64 1x 32-bit ARM dual-core or better
RAM~1GB for 1TB of storage per daemon
Volume Storage1x storage drive per daemon
Journal1x SSD partition per daemon (optional)
Network2x 1GB Ethernet NICs
ceph-monProcessor1x 64-bit AMD-64 1x 32-bit ARM dual-core or better
RAM1 GB per daemon
Disk Space10 GB per daemon
Network2x 1GB Ethernet NICs
ceph-mdsProcessor1x 64-bit AMD-64 quad-core 1x 32-bit ARM quad-core
RAM1 GB minimum per daemon
Disk Space1 MB per daemon
Network2x 1GB Ethernet NICs

下表是 reef 版本的官方推荐最小配置 [3]

ProcessCriteriaBare Minimum and Recommended
ceph-osdProcessor1 core minimum, 2 recommended 1 core per 200-500 MB/s throughput 1 core per 1000-3000 IOPS Results are before replication. Results may vary across CPU and drive models and Ceph configuration: (erasure coding, compression, etc) ARM processors specifically may require more cores for performance. SSD OSDs, especially NVMe, will benefit from additional cores per OSD. Actual performance depends on many factors including drives, net, and client throughput and latency. Benchmarking is highly recommended.
RAM4GB+ per daemon (more is better) 2-4GB may function but may be slow Less than 2GB is not recommended
Storage Drives1x storage drive per OSD
DB/WAL (optional)1x SSD partion per HDD OSD 4-5x HDD OSDs per DB/WAL SATA SSD <= 10 HDD OSDss per DB/WAL NVMe SSD
Network1x 1Gb/s (bonded 10+ Gb/s recommended)
ceph-monProcessor2 cores minimum
RAM5GB+ per daemon (large / production clusters need more)
Storage100 GB per daemon, SSD is recommended
Network1x 1Gb/s (10+ Gb/s recommended)
ceph-mdsProcessor2 cores minimum
RAM2GB+ per daemon (more for production)
Disk Space1 GB per daemon
Network1x 1Gb/s (10+ Gb/s recommended)

我们使用Ceph环境的示例

用于 Openstack 环境的 Ceph OSD 使用内存记录,主要使用于RDB,机器配置为 1.8T, 900G 的混合硬盘,内存配置 512G, 可以看到 OSD 内存使用率在 0.3% 大概每个 OSD 使用内存量为 2GB。

bash
1
2
3
4
5
6
7
PID USER      PR  NI    VIRT    RES    SHR S  %CPU    %MEM     TIME+     COMMAND
225398 ceph   20   0 3849464   1.7g  22200 S   11.9    0.3    14501:14   ceph-osd    
224860 ceph   20   0 3612380   1.7g  22424 S   9.2     0.3    12697:04   ceph-osd
223902 ceph   20   0 3340844   1.7g  22172 S   8.6     0.3    21003:18   ceph-osd   
223440 ceph   20   0 3213884   1.7g  22288 S   5.9     0.3     8548:00   ceph-osd
224368 ceph   20   0 3292848   1.6g  22204 S   4.0     0.3     8655:56   ceph-osd     
222889 ceph   20   0 3231012   1.7g  22180 S   3.3     0.3     8190:03   ceph-osd

用于业务使用的 Ceph OSD,主要用于对象存储,机器配置为 8c/16G,硬盘是 700G 每块,可以看到每个 OSD 使用的内存大概为 1.8-2G,大概 OSD 的分布是每个节点最多三个 OSD。

Ceph node 01

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# ceph node 01
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph        1702  0.4 27.9 10128296 4550760 ?    Ssl  May03 919:18 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node01 --setuser ceph --setgroup ceph
ceph        1721  0.6 12.8 3318456 2088704 ?     Ssl  May03 1216:59 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph        1983  0.6 12.3 3358788 2012844 ?     Ssl  May03 1273:25 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph        1991  0.9 11.7 3451788 1912008 ?     Ssl  May03 1719:04 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        1709  0.5  7.4 1646276 1212576 ?     Ssl  May03 1047:48 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph       18979  1.0  4.5 1330064 742680 ?      Ssl  May03 1932:51 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph      529617  3.7  4.4 1909588 721492 ?      Ssl  Jul15 3140:39 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root         801  0.0  0.6 182536 98516 ?        Ss   May03 105:28 /usr/lib/systemd/systemd-journald
root        1704  0.0  0.3 701284 50132 ?        Ssl  May03  53:48 /usr/sbin/rsyslogd -n

Ceph node02

bash
1
2
3
4
5
6
7
8
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph       28650  1.4 12.8 3958988 2104296 ?     Ssl   2023 6214:07 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph
ceph      163854  1.4 12.7 3782156 2096396 ?     Ssl   2023 6092:28 /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser ceph --setgroup ceph
ceph     3801660  1.5 11.9 3389284 1959812 ?     Ssl  Jul10 1384:08 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph
root     3348820  0.1  0.1 510848 27732 ?        Sl   Jun27 171:24 /var/ossec/bin/wazuh-modulesd
root        1045  0.0  0.1 574296 21468 ?        Ssl   2023  85:44 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
polkitd      670  0.0  0.0 612348 14992 ?        Ssl   2023  10:40 /usr/lib/polkit-1/polkitd --no-debug

Ceph node03

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph     1942206  0.9 12.8 4214720 2092280 ?     Ssl   2023 7866:23 /usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph
ceph        2824  0.8 12.6 4274848 2051800 ?     Ssl   2022 7205:58 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph     2802022  0.7 12.5 3831320 2047440 ?     Ssl   2023 4078:51 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
ceph        1693  0.7  4.7 1439428 771228 ?      Ssl   2022 6767:46 /usr/bin/ceph-mon -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
ceph     1058494  0.3  2.2 7492512 367288 ?      Ssl   2023 3388:44 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node03 --setuser ceph --setgroup ceph
ceph     1812870  2.6  0.8 970928 133116 ?       Ssl   Mar21 6749:43 /usr/bin/ceph-mgr -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
root         778  0.0  0.1  76412 28084 ?        Ss    2022 113:06 /usr/lib/systemd/systemd-journald
ceph        1739  0.4  0.1 384760 28064 ?        Ssl   2022 4086:33 /usr/bin/ceph-mds -f --cluster ceph --id node03 --setuser ceph --setgroup ceph

Ceph node04,该节点上只有一个 OSD

bash
1
2
3
4
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph       83779  1.0 12.7 3911168 2087332 ?     Ssl  Jan23 3473:50 /usr/bin/ceph-osd -f --cluster ceph --id 12 --setuser ceph --setgroup ceph
root        6568  0.0  0.0 113020  7808 ?        Ss   Jan22   0:00 /usr/sbin/sshd -D

配置的一个需求

osd 在运行在没有限制的情况下运行会消耗所有的可用内存,所以当数据节点配置不当,也会引起 oomkiller

The OSDs are designed to consume all the available memory if they are run without limits. So it is recommended to apply the resource limits, and the OSDs will stay within the bounds you set. Typically 4GB is sufficient per OSD. [4]

当 OSD 经历恢复时,它们的内存利用率会达到峰值。如果可用的 RAM 不足,OSD 性能会显着降低,守护进程甚至可能崩溃或被 Linux OOM Killer杀死。[5]

使用 cephadm 部署的机器群可以通过下面命令查看内存使用情况

bash
1
ceph orch ps

通常只有两种类型的守护进程有内存限制:mon 和 osd,这些内存限制参数由如下配置进行控制的

bash
1
2
3
4
sudo ceph config get mon mon_memory_target  # in bytes
sudo ceph config get mon mon_memory_autotune
sudo ceph config get osd osd_memory_target  # in bytes
sudo ceph config get osd osd_memory_target_autotune

通过 orch ps 查看的内存限制是不同于 ceph osd 的目标值的,BlueStore 将 OSD 堆内存使用量保留在指定目标大小下,并使用 osd_memory_target 配置选项。

选项 osd_memory_target 根据系统中可用的 RAM 来设置 OSD 内存。当 TCMalloc 配置为内存分配器,BlueStore 中的 bluestore_cache_autotune 选项设为 true 时,则使用此选项。

查看现有集群 osd 的配置

bash
1
2
3
4
# 显示存储集群中的所有 OSD osd_memory_target
sudo ceph config get osd osd_memory_target
# 显示指定 OSD osd_memory_target
sudo ceph config get osd.0 osd_memory_target

配置集群 OSD osd_memory_target

bash
1
2
3
4
# 为存储集群中的所有 OSD 设置 osd_memory_target
ceph config set osd osd_memory_target VALUE
# 为存储集群中的指定 OSD 设置 osd_memory_target,.id 是 OSD 的 ID 
ceph config set osd.id osd_memory_target VALUE

网上案例

下面有两个网上搜到的案例,osd具有无限制的内存增长的案例

  • osd(s) with unlimited ram growth [6]
  • How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs” [7]

内存查看

使用统计命令,该命令的统计信息不需要运行探查器,也不会将堆分配信息转储到文件中。

bash
1
ceph tell osd.0 heap stats

使用内存池命令

bash
1
ceph daemon osd.NNN dump_mempools

使用 google-perftools,该命令会运行探针,来检测运行的命令

bash
1
2
3
google-pprof --text {path-to-daemon}  {log-path/filename}
# 例如
pprof --text /usr/bin/ceph-mon /var/log/ceph/mon.node1.profile.0001.heap

Reference

[1] Minimum hardware considerations

[2] minimum-hardware-recommendations nautilus

[3] minimum-hardware-recommendations reef

[4] Excessive OSD memory usage #12078

[5] Ceph OSD 故障排除之内存不足

[6] osd(s) with unlimited ram growth

[7] How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs”

[8] Memory Profiling