本文发布于Cylon的收藏册,转载请著名原文链接~
本文记录了在使用 ceph 集群时遭遇到的内存问题,以及引用和参考一些资料用于对在 ceph 集群使用时的内存预估。
OSD的内存需求
如何评估 Ceph OSD 所需的硬件也是对于集群选型,集群优化的一个必要条件,这里主要找到两个可靠的参考资料用于评估 OSD 内存配置大小
IBM Storage Ceph
IBM Storage Ceph 提供了一个运行 Ceph 用于预估系统配置的一个最小推荐列表 [1],个人感觉可以参考这些信息用于自己集群的优化。主要用于容器化的 Ceph 集群
Process | Criteria | Minimum Recommended |
---|---|---|
ceph-osd-container | Processor | 1x AMD64 or Intel 64 CPU CORE per OSD container |
RAM | Minimum of 5 GB of RAM per OSD container | |
OS Disk | 1x OS disk per host | |
OSD Storage | 1x storage drive per OSD container. Cannot be shared with OS Disk. | |
block.db | Optional, but IBM recommended, 1x SSD or NVMe or Optane partition or lvm per daemon. Sizing is 4% of block.data for BlueStore for object, file, and mixed workloads and 1% of block.data for the BlueStore for Block Device, Openstack cinder, and Openstack cinder workloads. |
|
block.wal |
Optionally, 1x SSD or NVMe or Optane partition or logical volume per daemon. Use a small size, for example 10 GB, and only if it’s faster than the block.db device. |
|
Network | 2x 10 GB Ethernet NICs | |
ceph-mon-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mon-container |
RAM | 3 GB per mon-container |
|
Disk Space | 10 GB per mon-container , 50 GB Recommended |
|
Monitor Disk | Optionally, 1x SSD disk for Monitor rocksdb data |
|
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended | |
Prometheus | 20 GB to 50 GB under /var/lib/ceph/ directory created as a separate file system to protect the contents under /var/ directory. |
|
ceph-mgr-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mgr-container |
RAM | 3 GB per mgr-container |
|
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended | |
ceph-radosgw-container | Processor | 1x AMD64 or Intel 64 CPU CORE per radosgw-container |
RAM | 1 GB per daemon | |
Disk Space | 5 GB per daemon | |
Network | 1x 1 GB Ethernet NICs | |
ceph-mds-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mds-container |
RAM | 3 GB per mds-container This number is highly dependent on the configurable MDS cache size. The RAM requirement is typically twice as much as the amount set in the mds_cache_memory_limit configuration setting. Note also that this is the memory for your daemon, not the overall system memory. |
|
Disk Space | 2 GB per mds-container , plus considering any additional space required for possible debug logging, 20 GB is a good start. |
|
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended Note that this is the same network as the OSD containers. If you have a 10 GB network on your OSDs you should use the same on your MDS so that the MDS is not disadvantaged when it comes to latency. |
Hardware Recommendations
Ceph 官方也提供了相应的硬件配置推荐,关键参数写的比较清晰,但实际的规模比较模棱两可,也是可以提供一些参考的,并且每个版本的 Ceph 所推荐的硬件也是不相同的。
下表是 Ceph nautilus 的推荐最小硬件 [2]
Process | Criteria | Minimum Recommended |
---|---|---|
ceph-osd | Processor | 1x 64-bit AMD-64 1x 32-bit ARM dual-core or better |
RAM | ~1GB for 1TB of storage per daemon | |
Volume Storage | 1x storage drive per daemon | |
Journal | 1x SSD partition per daemon (optional) | |
Network | 2x 1GB Ethernet NICs | |
ceph-mon | Processor | 1x 64-bit AMD-64 1x 32-bit ARM dual-core or better |
RAM | 1 GB per daemon | |
Disk Space | 10 GB per daemon | |
Network | 2x 1GB Ethernet NICs | |
ceph-mds | Processor | 1x 64-bit AMD-64 quad-core 1x 32-bit ARM quad-core |
RAM | 1 GB minimum per daemon | |
Disk Space | 1 MB per daemon | |
Network | 2x 1GB Ethernet NICs |
下表是 reef 版本的官方推荐最小配置 [3]
Process | Criteria | Bare Minimum and Recommended |
---|---|---|
ceph-osd | Processor | 1 core minimum, 2 recommended 1 core per 200-500 MB/s throughput 1 core per 1000-3000 IOPS Results are before replication. Results may vary across CPU and drive models and Ceph configuration: (erasure coding, compression, etc) ARM processors specifically may require more cores for performance. SSD OSDs, especially NVMe, will benefit from additional cores per OSD. Actual performance depends on many factors including drives, net, and client throughput and latency. Benchmarking is highly recommended. |
RAM | 4GB+ per daemon (more is better) 2-4GB may function but may be slow Less than 2GB is not recommended | |
Storage Drives | 1x storage drive per OSD | |
DB/WAL (optional) | 1x SSD partion per HDD OSD 4-5x HDD OSDs per DB/WAL SATA SSD <= 10 HDD OSDss per DB/WAL NVMe SSD | |
Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) | |
ceph-mon | Processor | 2 cores minimum |
RAM | 5GB+ per daemon (large / production clusters need more) | |
Storage | 100 GB per daemon, SSD is recommended | |
Network | 1x 1Gb/s (10+ Gb/s recommended) | |
ceph-mds | Processor | 2 cores minimum |
RAM | 2GB+ per daemon (more for production) | |
Disk Space | 1 GB per daemon | |
Network | 1x 1Gb/s (10+ Gb/s recommended) |
我们使用Ceph环境的示例
用于 Openstack 环境的 Ceph OSD 使用内存记录,主要使用于RDB,机器配置为 1.8T, 900G 的混合硬盘,内存配置 512G, 可以看到 OSD 内存使用率在 0.3% 大概每个 OSD 使用内存量为 2GB。
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
225398 ceph 20 0 3849464 1.7g 22200 S 11.9 0.3 14501:14 ceph-osd
224860 ceph 20 0 3612380 1.7g 22424 S 9.2 0.3 12697:04 ceph-osd
223902 ceph 20 0 3340844 1.7g 22172 S 8.6 0.3 21003:18 ceph-osd
223440 ceph 20 0 3213884 1.7g 22288 S 5.9 0.3 8548:00 ceph-osd
224368 ceph 20 0 3292848 1.6g 22204 S 4.0 0.3 8655:56 ceph-osd
222889 ceph 20 0 3231012 1.7g 22180 S 3.3 0.3 8190:03 ceph-osd
用于业务使用的 Ceph OSD,主要用于对象存储,机器配置为 8c/16G,硬盘是 700G 每块,可以看到每个 OSD 使用的内存大概为 1.8-2G,大概 OSD 的分布是每个节点最多三个 OSD。
Ceph node 01
# ceph node 01
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 1702 0.4 27.9 10128296 4550760 ? Ssl May03 919:18 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node01 --setuser ceph --setgroup ceph
ceph 1721 0.6 12.8 3318456 2088704 ? Ssl May03 1216:59 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph 1983 0.6 12.3 3358788 2012844 ? Ssl May03 1273:25 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph 1991 0.9 11.7 3451788 1912008 ? Ssl May03 1719:04 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph 1709 0.5 7.4 1646276 1212576 ? Ssl May03 1047:48 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 18979 1.0 4.5 1330064 742680 ? Ssl May03 1932:51 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 529617 3.7 4.4 1909588 721492 ? Ssl Jul15 3140:39 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root 801 0.0 0.6 182536 98516 ? Ss May03 105:28 /usr/lib/systemd/systemd-journald
root 1704 0.0 0.3 701284 50132 ? Ssl May03 53:48 /usr/sbin/rsyslogd -n
Ceph node02
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 28650 1.4 12.8 3958988 2104296 ? Ssl 2023 6214:07 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph
ceph 163854 1.4 12.7 3782156 2096396 ? Ssl 2023 6092:28 /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser ceph --setgroup ceph
ceph 3801660 1.5 11.9 3389284 1959812 ? Ssl Jul10 1384:08 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph
root 3348820 0.1 0.1 510848 27732 ? Sl Jun27 171:24 /var/ossec/bin/wazuh-modulesd
root 1045 0.0 0.1 574296 21468 ? Ssl 2023 85:44 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
polkitd 670 0.0 0.0 612348 14992 ? Ssl 2023 10:40 /usr/lib/polkit-1/polkitd --no-debug
Ceph node03
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 1942206 0.9 12.8 4214720 2092280 ? Ssl 2023 7866:23 /usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph
ceph 2824 0.8 12.6 4274848 2051800 ? Ssl 2022 7205:58 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph 2802022 0.7 12.5 3831320 2047440 ? Ssl 2023 4078:51 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
ceph 1693 0.7 4.7 1439428 771228 ? Ssl 2022 6767:46 /usr/bin/ceph-mon -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
ceph 1058494 0.3 2.2 7492512 367288 ? Ssl 2023 3388:44 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node03 --setuser ceph --setgroup ceph
ceph 1812870 2.6 0.8 970928 133116 ? Ssl Mar21 6749:43 /usr/bin/ceph-mgr -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
root 778 0.0 0.1 76412 28084 ? Ss 2022 113:06 /usr/lib/systemd/systemd-journald
ceph 1739 0.4 0.1 384760 28064 ? Ssl 2022 4086:33 /usr/bin/ceph-mds -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
Ceph node04,该节点上只有一个 OSD
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 83779 1.0 12.7 3911168 2087332 ? Ssl Jan23 3473:50 /usr/bin/ceph-osd -f --cluster ceph --id 12 --setuser ceph --setgroup ceph
root 6568 0.0 0.0 113020 7808 ? Ss Jan22 0:00 /usr/sbin/sshd -D
配置的一个需求
osd 在运行在没有限制的情况下运行会消耗所有的可用内存,所以当数据节点配置不当,也会引起 oomkiller
The OSDs are designed to consume all the available memory if they are run without limits. So it is recommended to apply the resource limits, and the OSDs will stay within the bounds you set. Typically 4GB is sufficient per OSD. [4]
当 OSD 经历恢复时,它们的内存利用率会达到峰值。如果可用的 RAM 不足,OSD 性能会显着降低,守护进程甚至可能崩溃或被 Linux OOM Killer杀死。[5]
使用 cephadm 部署的机器群可以通过下面命令查看内存使用情况
ceph orch ps
通常只有两种类型的守护进程有内存限制:mon 和 osd,这些内存限制参数由如下配置进行控制的
sudo ceph config get mon mon_memory_target # in bytes
sudo ceph config get mon mon_memory_autotune
sudo ceph config get osd osd_memory_target # in bytes
sudo ceph config get osd osd_memory_target_autotune
通过 orch ps 查看的内存限制是不同于 ceph osd 的目标值的,BlueStore 将 OSD 堆内存使用量保留在指定目标大小下,并使用 osd_memory_target
配置选项。
选项
osd_memory_target
根据系统中可用的 RAM 来设置 OSD 内存。当 TCMalloc 配置为内存分配器,BlueStore 中的bluestore_cache_autotune
选项设为true
时,则使用此选项。
查看现有集群 osd 的配置
# 显示存储集群中的所有 OSD osd_memory_target
sudo ceph config get osd osd_memory_target
# 显示指定 OSD osd_memory_target
sudo ceph config get osd.0 osd_memory_target
配置集群 OSD osd_memory_target
# 为存储集群中的所有 OSD 设置 osd_memory_target
ceph config set osd osd_memory_target VALUE
# 为存储集群中的指定 OSD 设置 osd_memory_target,.id 是 OSD 的 ID
ceph config set osd.id osd_memory_target VALUE
网上案例
下面有两个网上搜到的案例,osd具有无限制的内存增长的案例
- osd(s) with unlimited ram growth [6]
- How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs” [7]
内存查看
使用统计命令,该命令的统计信息不需要运行探查器,也不会将堆分配信息转储到文件中。
ceph tell osd.0 heap stats
使用内存池命令
ceph daemon osd.NNN dump_mempools
使用 google-perftools,该命令会运行探针,来检测运行的命令
google-pprof --text {path-to-daemon} {log-path/filename}
# 例如
pprof --text /usr/bin/ceph-mon /var/log/ceph/mon.node1.profile.0001.heap
Reference
[1] Minimum hardware considerations
[2] minimum-hardware-recommendations nautilus
[3] minimum-hardware-recommendations reef
[4] Excessive OSD memory usage #12078
[6] osd(s) with unlimited ram growth
[7] How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs”
[8] Memory Profiling
本文发布于Cylon的收藏册,转载请著名原文链接~
链接:https://www.oomkill.com/2024/09/03-3-ceph-osd-performance-recommendation/
版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」 许可协议进行许可。