记录一次失败的radosgw问题排查记录

记录一次因着急没有检查原因而直接下线 ceph 对象存储的的失败记录

操作流程

ceph 节点内存持续超过90%，因为本身有三个 OSD，检查内存使用情况发现 radosgw

bash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ ps aux --sort=-%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph        1702  0.4 32.9 10128296 4550760 ?    Ssl  May03 919:18 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node01 --setuser ceph --setgroup ceph
ceph        1721  0.6 12.8 3318456 2088704 ?     Ssl  May03 1216:59 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph        1983  0.6 12.3 3358788 2012844 ?     Ssl  May03 1273:25 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph        1991  0.9 11.7 3451788 1912008 ?     Ssl  May03 1719:04 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        1709  0.5  7.4 1646276 1212576 ?     Ssl  May03 1047:48 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph       18979  1.0  4.5 1330064 742680 ?      Ssl  May03 1932:51 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph      529617  3.7  4.4 1909588 721492 ?      Ssl  Jul15 3140:39 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root         801  0.0  0.6 182536 98516 ?        Ss   May03 105:28 /usr/lib/systemd/systemd-journald
root        1704  0.0  0.3 701284 50132 ?        Ssl  May03  53:48 /usr/sbin/rsyslogd -n

因为这台节点包含3个 OSD, ceph-mon, ceph-mds 等全功能使用，所以最初的想法是 radosgw 转移到其他节点上，而不是分析为什么 radosgw 进程使用内存较高

申请一个新节点部署 radosgw，部署时出现错误没有提示日志

bash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ ceph-deploy rgw create node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy rgw create node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  rgw                           : [('node06', 'rgw.node06')]
  eph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fe8a9b583f8>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function rgw at 0x7fe8aa412050>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.rgw][DEBUG ] Deploying rgw, cluster ceph hosts node06:rgw.node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

sudo: no tty present and no askpass program specified
[ceph_deploy.rgw][ERROR ] connecting to host: node06 resulted in errors: IOError cannot send (already closed?)
[ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs

通过 journalctl -u 查看到如下错误 (新节点)

bash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sep 12 14:51:42 node06 sshd[30495]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30495]: Accepted publickey for ceph from 192.168.20.88 port 37872 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 97 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 97 of user ceph.
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sshd[30497]: Received disconnect from 192.168.20.88 port 37872:11: disconnected by user
Sep 12 14:51:42 node06 sshd[30497]: Disconnected from 192.168.20.88 port 37872
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:42 node06 systemd-logind[743]: Removed session 97.
Sep 12 14:51:42 node06 systemd[1]: Removed slice User Slice of ceph.
Sep 12 14:51:42 node06 sshd[30521]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30521]: Accepted publickey for ceph from 192.168.20.88 port 37874 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 98 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 98 of user ceph.
Sep 12 14:51:42 node06 sshd[30521]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): conversation failed
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): auth could not identify password for [ceph]
Sep 12 14:51:45 node06 sudo[30526]:     ceph : user NOT in sudoers ; TTY=unknown ; PWD=/home/ceph ; USER=root ; COMMAND=/bin/python2 -c import sys;exec(eval(sys.stdin.readline()))
Sep 12 14:51:45 node06 sshd[30525]: Received disconnect from 192.168.20.88 port 37874:11: disconnected by user
Sep 12 14:51:45 node06 sshd[30525]: Disconnected from 192.168.20.88 port 37874
Sep 12 14:51:45 node06 sshd[30521]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:45 node06 postfix/sendmail[30549]: fatal: parameter inet_interfaces: no local interface found for ::1

配置 sudo

bash

1
2
#cat /etc/sudoers.d/ceph   
ceph ALL = (root) NOPASSWD:ALL

配置完成后部署 radosgw

bash

1
2
3
4
5
6
# 拷贝配置文件
ceph-deploy --overwrite-conf config push node06
# 安装软件包
ceph-deploy install  --no-adjust-repos --nogpgcheck node06
# new一个新 rgw 实例，ceph-deploy 只支持new
ceph-deploy rgw create node06

完整的输出

bash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
$ ceph-deploy --overwrite-conf config push node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy --overwrite-conf config push node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : True
[ceph_deploy.cli][INFO  ]  subcommand                    : push
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f3e96c9e8c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  client                        : ['node06']
[ceph_deploy.cli][INFO  ]  func                          : <function config at 0x7f3e96ec9c08>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.config][DEBUG ] Pushing config to node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06 
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[node06][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf


$ ceph-deploy install  --no-adjust-repos --nogpgcheck node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.1): /bin/ceph-deploy install --no-adjust-repos --nogpgcheck node06
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  testing                       : None
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fc503ac4758>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  dev_commit                    : None
[ceph_deploy.cli][INFO  ]  install_mds                   : False
[ceph_deploy.cli][INFO  ]  stable                        : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  adjust_repos                  : False
[ceph_deploy.cli][INFO  ]  func                          : <function install at 0x7fc5041125f0>
[ceph_deploy.cli][INFO  ]  install_mgr                   : False
[ceph_deploy.cli][INFO  ]  install_all                   : False
[ceph_deploy.cli][INFO  ]  repo                          : False
[ceph_deploy.cli][INFO  ]  host                          : ['node06']
[ceph_deploy.cli][INFO  ]  install_rgw                   : False
[ceph_deploy.cli][INFO  ]  install_tests                 : False
[ceph_deploy.cli][INFO  ]  repo_url                      : None
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  install_osd                   : False
[ceph_deploy.cli][INFO  ]  version_kind                  : stable
[ceph_deploy.cli][INFO  ]  install_common                : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  dev                           : master
[ceph_deploy.cli][INFO  ]  nogpgcheck                    : True
[ceph_deploy.cli][INFO  ]  local_mirror                  : None
[ceph_deploy.cli][INFO  ]  release                       : None
[ceph_deploy.cli][INFO  ]  install_mon                   : False
[ceph_deploy.cli][INFO  ]  gpg_url                       : None
[ceph_deploy.install][DEBUG ] Installing stable version mimic on cluster ceph hosts node06
[ceph_deploy.install][DEBUG ] Detecting platform for host node06 ...
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06 
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: CentOS Linux 7.9.2009 Core
[node06][INFO  ] installing Ceph on node06
[node06][INFO  ] Running command: sudo yum clean all
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Cleaning repos: base centos-sclo-rh centos-sclo-sclo devops-Extra epel extras
[node06][DEBUG ]               : openresty remi-php72 remi-php73 remi-php74 remi-safe salt-latest
[node06][DEBUG ]               : tools-repo updates zabbix
[node06][DEBUG ] Cleaning up list of fastest mirrors
[node06][DEBUG ] Other repos take up 23 M of disk space (use --verbose for details)
[node06][INFO  ] Running command: sudo yum -y install ceph ceph-radosgw
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Determining fastest mirrors
[node06][DEBUG ] No package ceph available.
[node06][DEBUG ] Nothing to do
[node06][INFO  ] Running command: sudo ceph --version
[node06][DEBUG ] ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)

查看节点是否上线

bash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ ceph -s
  cluster:
    id:     baf87797-3ec1-4f2c-8126-bf0a44051b13
    health: HEALTH_WARN
            1 pools have many more objects per pg than average
 
  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 2w)
    mgr: node01(active, since 8w), standbys: node02, node03
    mds: kubefs:2 {0=node01=up:active,1=node02=up:active} 1 up:standby
    osd: 13 osds: 13 up (since 6d), 13 in (since 7M)
    rgw: 4 daemons active (node01, node02, node03, node06)

流量的请求时访问 radosgw 服务，这个时候新实例是没有引入流量的，需要修改负载均衡器增加新的节点进来，流量引入后需要确认旧服务已经不在处理业务请求后可以下线确认请求，查看活跃连接

bash

1
2
3
$ netstat -an|grep 7480
tcp        0      0 0.0.0.0:7480            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.20.84:7480      192.168.20.84:33152      ESTABLISHED

确认请求，查看服务日志

bash

1
$ tail -f /var/log/ceph/ceph-client.rgw.node01.log

确认无误可以下线，ceph-deploy 部署的服务没有 cephadm ceph orch rgw delete xx 这类工具进行下线，直接通过 systemd 停止服务即可

bash

1
2
3
4
$ systemctl -l|grep rados
  ceph-radosgw@rgw.node01.service                                                     loaded active     running      Ceph rados gateway
  system-ceph\x2dradosgw.slice                                                                loaded active     active       system-ceph\x2dradosgw.slice
  ceph-radosgw.target                                                                         loaded active     active       ceph target allowing to start/stop all ceph-radosgw@.service instances at once

停止服务并检查内存状态

bash

1
2
3
4
5
6
7
8
9
$ systemctl stop ceph-radosgw@rgw.node01.service  
#ps axu --sort=-%mem|head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph        1983  0.6 12.8 3358788 2084324 ?     Ssl  May03 1275:36 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph        1991  0.9 12.5 3451788 2033560 ?     Ssl  May03 1722:12 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        1721  0.6 11.8 3318456 1920876 ?     Ssl  May03 1219:27 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph        1709  0.5  7.4 1646276 1212516 ?     Ssl  May03 1050:21 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph       18979  1.0  4.5 1330064 744972 ?      Ssl  May03 1937:16 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph      529617  3.7  4.4 1914452 726436 ?      Ssl  Jul15 3153:14 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph

总结

本次操作没有分析为什么使用内存高，只是着急做了迁移，这样导致在事后无法确定问题的根本原因，后期遇到问题要先分析并保留证据，其次在做迁移之类动作。

操作流程#

总结#

相关阅读

操作流程

总结