本文记录了在使用 ceph 集群时遭遇到的内存问题,以及引用和参考一些资料用于对在 ceph 集群使用时的内存预估。
OSD的内存需求
如何评估 Ceph OSD 所需的硬件也是对于集群选型,集群优化的一个必要条件,这里主要找到两个可靠的参考资料用于评估 OSD 内存配置大小
IBM Storage Ceph
IBM Storage Ceph 提供了一个运行 Ceph 用于预估系统配置的一个最小推荐列表 [1],个人感觉可以参考这些信息用于自己集群的优化。主要用于容器化的 Ceph 集群
Process | Criteria | Minimum Recommended |
---|---|---|
ceph-osd-container | Processor | 1x AMD64 or Intel 64 CPU CORE per OSD container |
RAM | Minimum of 5 GB of RAM per OSD container | |
OS Disk | 1x OS disk per host | |
OSD Storage | 1x storage drive per OSD container. Cannot be shared with OS Disk. | |
block.db | Optional, but IBM recommended, 1x SSD or NVMe or Optane partition or lvm per daemon. Sizing is 4% of block.data for BlueStore for object, file, and mixed workloads and 1% of block.data for the BlueStore for Block Device, Openstack cinder, and Openstack cinder workloads. | |
block.wal | Optionally, 1x SSD or NVMe or Optane partition or logical volume per daemon. Use a small size, for example 10 GB, and only if it’s faster than the block.db device. | |
Network | 2x 10 GB Ethernet NICs | |
ceph-mon-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mon-container |
RAM | 3 GB per mon-container | |
Disk Space | 10 GB per mon-container , 50 GB Recommended | |
Monitor Disk | Optionally, 1x SSD disk for Monitor rocksdb data | |
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended | |
Prometheus | 20 GB to 50 GB under /var/lib/ceph/ directory created as a separate file system to protect the contents under /var/ directory. | |
ceph-mgr-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mgr-container |
RAM | 3 GB per mgr-container | |
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended | |
ceph-radosgw-container | Processor | 1x AMD64 or Intel 64 CPU CORE per radosgw-container |
RAM | 1 GB per daemon | |
Disk Space | 5 GB per daemon | |
Network | 1x 1 GB Ethernet NICs | |
ceph-mds-container | Processor | 1x AMD64 or Intel 64 CPU CORE per mds-container |
RAM | 3 GB per mds-container This number is highly dependent on the configurable MDS cache size. The RAM requirement is typically twice as much as the amount set in the mds_cache_memory_limit configuration setting. Note also that this is the memory for your daemon, not the overall system memory. | |
Disk Space | 2 GB per mds-container , plus considering any additional space required for possible debug logging, 20 GB is a good start. | |
Network | 2x 1 GB Ethernet NICs, 10 GB Recommended Note that this is the same network as the OSD containers. If you have a 10 GB network on your OSDs you should use the same on your MDS so that the MDS is not disadvantaged when it comes to latency. |
Hardware Recommendations
Ceph 官方也提供了相应的硬件配置推荐,关键参数写的比较清晰,但实际的规模比较模棱两可,也是可以提供一些参考的,并且每个版本的 Ceph 所推荐的硬件也是不相同的。
下表是 Ceph nautilus 的推荐最小硬件 [2]
Process | Criteria | Minimum Recommended |
---|---|---|
ceph-osd | Processor | 1x 64-bit AMD-64 1x 32-bit ARM dual-core or better |
RAM | ~1GB for 1TB of storage per daemon | |
Volume Storage | 1x storage drive per daemon | |
Journal | 1x SSD partition per daemon (optional) | |
Network | 2x 1GB Ethernet NICs | |
ceph-mon | Processor | 1x 64-bit AMD-64 1x 32-bit ARM dual-core or better |
RAM | 1 GB per daemon | |
Disk Space | 10 GB per daemon | |
Network | 2x 1GB Ethernet NICs | |
ceph-mds | Processor | 1x 64-bit AMD-64 quad-core 1x 32-bit ARM quad-core |
RAM | 1 GB minimum per daemon | |
Disk Space | 1 MB per daemon | |
Network | 2x 1GB Ethernet NICs |
下表是 reef 版本的官方推荐最小配置 [3]
Process | Criteria | Bare Minimum and Recommended |
---|---|---|
ceph-osd | Processor | 1 core minimum, 2 recommended 1 core per 200-500 MB/s throughput 1 core per 1000-3000 IOPS Results are before replication. Results may vary across CPU and drive models and Ceph configuration: (erasure coding, compression, etc) ARM processors specifically may require more cores for performance. SSD OSDs, especially NVMe, will benefit from additional cores per OSD. Actual performance depends on many factors including drives, net, and client throughput and latency. Benchmarking is highly recommended. |
RAM | 4GB+ per daemon (more is better) 2-4GB may function but may be slow Less than 2GB is not recommended | |
Storage Drives | 1x storage drive per OSD | |
DB/WAL (optional) | 1x SSD partion per HDD OSD 4-5x HDD OSDs per DB/WAL SATA SSD <= 10 HDD OSDss per DB/WAL NVMe SSD | |
Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) | |
ceph-mon | Processor | 2 cores minimum |
RAM | 5GB+ per daemon (large / production clusters need more) | |
Storage | 100 GB per daemon, SSD is recommended | |
Network | 1x 1Gb/s (10+ Gb/s recommended) | |
ceph-mds | Processor | 2 cores minimum |
RAM | 2GB+ per daemon (more for production) | |
Disk Space | 1 GB per daemon | |
Network | 1x 1Gb/s (10+ Gb/s recommended) |
我们使用Ceph环境的示例
用于 Openstack 环境的 Ceph OSD 使用内存记录,主要使用于RDB,机器配置为 1.8T, 900G 的混合硬盘,内存配置 512G, 可以看到 OSD 内存使用率在 0.3% 大概每个 OSD 使用内存量为 2GB。
|
|
用于业务使用的 Ceph OSD,主要用于对象存储,机器配置为 8c/16G,硬盘是 700G 每块,可以看到每个 OSD 使用的内存大概为 1.8-2G,大概 OSD 的分布是每个节点最多三个 OSD。
Ceph node 01
|
|
Ceph node02
|
|
Ceph node03
|
|
Ceph node04,该节点上只有一个 OSD
|
|
配置的一个需求
osd 在运行在没有限制的情况下运行会消耗所有的可用内存,所以当数据节点配置不当,也会引起 oomkiller
The OSDs are designed to consume all the available memory if they are run without limits. So it is recommended to apply the resource limits, and the OSDs will stay within the bounds you set. Typically 4GB is sufficient per OSD. [4]
当 OSD 经历恢复时,它们的内存利用率会达到峰值。如果可用的 RAM 不足,OSD 性能会显着降低,守护进程甚至可能崩溃或被 Linux OOM Killer杀死。[5]
使用 cephadm 部署的机器群可以通过下面命令查看内存使用情况
|
|
通常只有两种类型的守护进程有内存限制:mon 和 osd,这些内存限制参数由如下配置进行控制的
|
|
通过 orch ps 查看的内存限制是不同于 ceph osd 的目标值的,BlueStore 将 OSD 堆内存使用量保留在指定目标大小下,并使用 osd_memory_target
配置选项。
选项
osd_memory_target
根据系统中可用的 RAM 来设置 OSD 内存。当 TCMalloc 配置为内存分配器,BlueStore 中的bluestore_cache_autotune
选项设为true
时,则使用此选项。
查看现有集群 osd 的配置
|
|
配置集群 OSD osd_memory_target
|
|
网上案例
下面有两个网上搜到的案例,osd具有无限制的内存增长的案例
- osd(s) with unlimited ram growth [6]
- How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs” [7]
内存查看
使用统计命令,该命令的统计信息不需要运行探查器,也不会将堆分配信息转储到文件中。
|
|
使用内存池命令
|
|
使用 google-perftools,该命令会运行探针,来检测运行的命令
|
|
Reference
[1] Minimum hardware considerations
[2] minimum-hardware-recommendations nautilus
[3] minimum-hardware-recommendations reef
[4] Excessive OSD memory usage #12078
[6] osd(s) with unlimited ram growth
[7] How to solve “the Out of Memory Killer issue that kills your OSDs due to bad entries in PG logs”
[8] Memory Profiling