记录一次因着急没有检查原因而直接下线 ceph 对象存储的的失败记录

操作流程

ceph 节点内存持续超过90%,因为本身有三个 OSD,检查内存使用情况发现 radosgw

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph        1702  0.4 32.9 10128296 4550760 ?    Ssl  May03 919:18 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node01 --setuser ceph --setgroup ceph
ceph        1721  0.6 12.8 3318456 2088704 ?     Ssl  May03 1216:59 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph        1983  0.6 12.3 3358788 2012844 ?     Ssl  May03 1273:25 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph        1991  0.9 11.7 3451788 1912008 ?     Ssl  May03 1719:04 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        1709  0.5  7.4 1646276 1212576 ?     Ssl  May03 1047:48 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph       18979  1.0  4.5 1330064 742680 ?      Ssl  May03 1932:51 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph      529617  3.7  4.4 1909588 721492 ?      Ssl  Jul15 3140:39 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root         801  0.0  0.6 182536 98516 ?        Ss   May03 105:28 /usr/lib/systemd/systemd-journald
root        1704  0.0  0.3 701284 50132 ?        Ssl  May03  53:48 /usr/sbin/rsyslogd -n

因为这台节点包含3个 OSD, ceph-mon, ceph-mds 等全功能使用,所以最初的想法是 radosgw 转移到其他节点上,而不是分析为什么 radosgw 进程使用内存较高

申请一个新节点部署 radosgw,部署时出现错误没有提示日志

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ ceph-deploy rgw create node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy rgw create node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  rgw                           : [('node06', 'rgw.node06')]
  eph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fe8a9b583f8>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function rgw at 0x7fe8aa412050>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.rgw][DEBUG ] Deploying rgw, cluster ceph hosts node06:rgw.node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

sudo: no tty present and no askpass program specified
[ceph_deploy.rgw][ERROR ] connecting to host: node06 resulted in errors: IOError cannot send (already closed?)
[ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs

通过 journalctl -u 查看到如下错误 (新节点)

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sep 12 14:51:42 node06 sshd[30495]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30495]: Accepted publickey for ceph from 192.168.20.88 port 37872 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 97 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 97 of user ceph.
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sshd[30497]: Received disconnect from 192.168.20.88 port 37872:11: disconnected by user
Sep 12 14:51:42 node06 sshd[30497]: Disconnected from 192.168.20.88 port 37872
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:42 node06 systemd-logind[743]: Removed session 97.
Sep 12 14:51:42 node06 systemd[1]: Removed slice User Slice of ceph.
Sep 12 14:51:42 node06 sshd[30521]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30521]: Accepted publickey for ceph from 192.168.20.88 port 37874 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 98 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 98 of user ceph.
Sep 12 14:51:42 node06 sshd[30521]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): conversation failed
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): auth could not identify password for [ceph]
Sep 12 14:51:45 node06 sudo[30526]:     ceph : user NOT in sudoers ; TTY=unknown ; PWD=/home/ceph ; USER=root ; COMMAND=/bin/python2 -c import sys;exec(eval(sys.stdin.readline()))
Sep 12 14:51:45 node06 sshd[30525]: Received disconnect from 192.168.20.88 port 37874:11: disconnected by user
Sep 12 14:51:45 node06 sshd[30525]: Disconnected from 192.168.20.88 port 37874
Sep 12 14:51:45 node06 sshd[30521]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:45 node06 postfix/sendmail[30549]: fatal: parameter inet_interfaces: no local interface found for ::1

配置 sudo

bash
1
2
#cat /etc/sudoers.d/ceph   
ceph ALL = (root) NOPASSWD:ALL

配置完成后部署 radosgw

bash
1
2
3
4
5
6
# 拷贝配置文件
ceph-deploy --overwrite-conf config push node06
# 安装软件包
ceph-deploy install  --no-adjust-repos --nogpgcheck node06
# new一个新 rgw 实例,ceph-deploy 只支持new
ceph-deploy rgw create node06

完整的输出

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
$ ceph-deploy --overwrite-conf config push node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy --overwrite-conf config push node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : True
[ceph_deploy.cli][INFO  ]  subcommand                    : push
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f3e96c9e8c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  client                        : ['node06']
[ceph_deploy.cli][INFO  ]  func                          : <function config at 0x7f3e96ec9c08>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.config][DEBUG ] Pushing config to node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06 
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[node06][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf


$ ceph-deploy install  --no-adjust-repos --nogpgcheck node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy install --no-adjust-repos --nogpgcheck node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  testing                       : None
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fc503ac4758>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  dev_commit                    : None
[ceph_deploy.cli][INFO  ]  install_mds                   : False
[ceph_deploy.cli][INFO  ]  stable                        : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  adjust_repos                  : False
[ceph_deploy.cli][INFO  ]  func                          : <function install at 0x7fc5041125f0>
[ceph_deploy.cli][INFO  ]  install_mgr                   : False
[ceph_deploy.cli][INFO  ]  install_all                   : False
[ceph_deploy.cli][INFO  ]  repo                          : False
[ceph_deploy.cli][INFO  ]  host                          : ['node06']
[ceph_deploy.cli][INFO  ]  install_rgw                   : False
[ceph_deploy.cli][INFO  ]  install_tests                 : False
[ceph_deploy.cli][INFO  ]  repo_url                      : None
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  install_osd                   : False
[ceph_deploy.cli][INFO  ]  version_kind                  : stable
[ceph_deploy.cli][INFO  ]  install_common                : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  dev                           : master
[ceph_deploy.cli][INFO  ]  nogpgcheck                    : True
[ceph_deploy.cli][INFO  ]  local_mirror                  : None
[ceph_deploy.cli][INFO  ]  release                       : None
[ceph_deploy.cli][INFO  ]  install_mon                   : False
[ceph_deploy.cli][INFO  ]  gpg_url                       : None
[ceph_deploy.install][DEBUG ] Installing stable version mimic on cluster ceph hosts node06
[ceph_deploy.install][DEBUG ] Detecting platform for host node06 ...
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06 
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: CentOS Linux 7.9.2009 Core
[node06][INFO  ] installing Ceph on node06
[node06][INFO  ] Running command: sudo yum clean all
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Cleaning repos: base centos-sclo-rh centos-sclo-sclo devops-Extra epel extras
[node06][DEBUG ]               : openresty remi-php72 remi-php73 remi-php74 remi-safe salt-latest
[node06][DEBUG ]               : tools-repo updates zabbix
[node06][DEBUG ] Cleaning up list of fastest mirrors
[node06][DEBUG ] Other repos take up 23 M of disk space (use --verbose for details)
[node06][INFO  ] Running command: sudo yum -y install ceph ceph-radosgw
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Determining fastest mirrors
[node06][DEBUG ] No package ceph available.
[node06][DEBUG ] Nothing to do
[node06][INFO  ] Running command: sudo ceph --version
[node06][DEBUG ] ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)

查看节点是否上线

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ ceph -s
  cluster:
    id:     baf87797-3ec1-4f2c-8126-bf0a44051b13
    health: HEALTH_WARN
            1 pools have many more objects per pg than average
 
  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 2w)
    mgr: node01(active, since 8w), standbys: node02, node03
    mds: kubefs:2 {0=node01=up:active,1=node02=up:active} 1 up:standby
    osd: 13 osds: 13 up (since 6d), 13 in (since 7M)
    rgw: 4 daemons active (node01, node02, node03, node06)

流量的请求时访问 radosgw 服务,这个时候新实例是没有引入流量的,需要修改负载均衡器增加新的节点进来,流量引入后需要确认旧服务已经不在处理业务请求后可以下线 确认请求,查看活跃连接

bash
1
2
3
$ netstat -an|grep 7480
tcp        0      0 0.0.0.0:7480            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.20.84:7480      192.168.20.84:33152      ESTABLISHED

确认请求,查看服务日志

bash
1
$ tail -f /var/log/ceph/ceph-client.rgw.node01.log

确认无误可以下线,ceph-deploy 部署的服务没有 cephadm ceph orch rgw delete xx 这类工具进行下线,直接通过 systemd 停止服务即可

bash
1
2
3
4
$ systemctl -l|grep rados
  ceph-radosgw@rgw.node01.service                                                     loaded active     running      Ceph rados gateway
  system-ceph\x2dradosgw.slice                                                                loaded active     active       system-ceph\x2dradosgw.slice
  ceph-radosgw.target                                                                         loaded active     active       ceph target allowing to start/stop all ceph-radosgw@.service instances at once

停止服务并检查内存状态

bash
1
2
3
4
5
6
7
8
9
$ systemctl stop ceph-radosgw@rgw.node01.service  
#ps axu --sort=-%mem|head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph        1983  0.6 12.8 3358788 2084324 ?     Ssl  May03 1275:36 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph        1991  0.9 12.5 3451788 2033560 ?     Ssl  May03 1722:12 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        1721  0.6 11.8 3318456 1920876 ?     Ssl  May03 1219:27 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph        1709  0.5  7.4 1646276 1212516 ?     Ssl  May03 1050:21 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph       18979  1.0  4.5 1330064 744972 ?      Ssl  May03 1937:16 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph      529617  3.7  4.4 1914452 726436 ?      Ssl  Jul15 3153:14 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph

总结

本次操作没有分析为什么使用内存高,只是着急做了迁移,这样导致在事后无法确定问题的根本原因,后期遇到问题要先分析并保留证据,其次在做迁移之类动作。