本文发布于Cylon的收藏册,转载请著名原文链接~
记录一次因着急没有检查原因而直接下线 ceph 对象存储的的失败记录
操作流程
ceph 节点内存持续超过90%,因为本身有三个 OSD,检查内存使用情况发现 radosgw
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 1702 0.4 32.9 10128296 4550760 ? Ssl May03 919:18 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node01 --setuser ceph --setgroup ceph
ceph 1721 0.6 12.8 3318456 2088704 ? Ssl May03 1216:59 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph 1983 0.6 12.3 3358788 2012844 ? Ssl May03 1273:25 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph 1991 0.9 11.7 3451788 1912008 ? Ssl May03 1719:04 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph 1709 0.5 7.4 1646276 1212576 ? Ssl May03 1047:48 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 18979 1.0 4.5 1330064 742680 ? Ssl May03 1932:51 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 529617 3.7 4.4 1909588 721492 ? Ssl Jul15 3140:39 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root 801 0.0 0.6 182536 98516 ? Ss May03 105:28 /usr/lib/systemd/systemd-journald
root 1704 0.0 0.3 701284 50132 ? Ssl May03 53:48 /usr/sbin/rsyslogd -n
因为这台节点包含3个 OSD, ceph-mon, ceph-mds 等全功能使用,所以最初的想法是 radosgw 转移到其他节点上,而不是分析为什么 radosgw 进程使用内存较高
申请一个新节点部署 radosgw,部署时出现错误没有提示日志
$ ceph-deploy rgw create node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (2.0.1): /bin/ceph-deploy rgw create node06
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] rgw : [('node06', 'rgw.node06')]
eph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] subcommand : create
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fe8a9b583f8>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] func : <function rgw at 0x7fe8aa412050>
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.rgw][DEBUG ] Deploying rgw, cluster ceph hosts node06:rgw.node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
sudo: no tty present and no askpass program specified
[ceph_deploy.rgw][ERROR ] connecting to host: node06 resulted in errors: IOError cannot send (already closed?)
[ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs
通过 journalctl -u 查看到如下错误 (新节点)
Sep 12 14:51:42 node06 sshd[30495]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30495]: Accepted publickey for ceph from 192.168.20.88 port 37872 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 97 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 97 of user ceph.
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sshd[30497]: Received disconnect from 192.168.20.88 port 37872:11: disconnected by user
Sep 12 14:51:42 node06 sshd[30497]: Disconnected from 192.168.20.88 port 37872
Sep 12 14:51:42 node06 sshd[30495]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:42 node06 systemd-logind[743]: Removed session 97.
Sep 12 14:51:42 node06 systemd[1]: Removed slice User Slice of ceph.
Sep 12 14:51:42 node06 sshd[30521]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 12 14:51:42 node06 sshd[30521]: Accepted publickey for ceph from 192.168.20.88 port 37874 ssh2: RSA SHA256:XBYUcCiYBhdw+V32qwx6x0wex1EhaMiSHuz0gQVayTQ
Sep 12 14:51:42 node06 systemd[1]: Created slice User Slice of ceph.
Sep 12 14:51:42 node06 systemd-logind[743]: New session 98 of user ceph.
Sep 12 14:51:42 node06 systemd[1]: Started Session 98 of user ceph.
Sep 12 14:51:42 node06 sshd[30521]: pam_unix(sshd:session): session opened for user ceph by (uid=0)
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): conversation failed
Sep 12 14:51:42 node06 sudo[30526]: pam_unix(sudo:auth): auth could not identify password for [ceph]
Sep 12 14:51:45 node06 sudo[30526]: ceph : user NOT in sudoers ; TTY=unknown ; PWD=/home/ceph ; USER=root ; COMMAND=/bin/python2 -c import sys;exec(eval(sys.stdin.readline()))
Sep 12 14:51:45 node06 sshd[30525]: Received disconnect from 192.168.20.88 port 37874:11: disconnected by user
Sep 12 14:51:45 node06 sshd[30525]: Disconnected from 192.168.20.88 port 37874
Sep 12 14:51:45 node06 sshd[30521]: pam_unix(sshd:session): session closed for user ceph
Sep 12 14:51:45 node06 postfix/sendmail[30549]: fatal: parameter inet_interfaces: no local interface found for ::1
配置 sudo
#cat /etc/sudoers.d/ceph
ceph ALL = (root) NOPASSWD:ALL
配置完成后部署 radosgw
# 拷贝配置文件
ceph-deploy --overwrite-conf config push node06
# 安装软件包
ceph-deploy install --no-adjust-repos --nogpgcheck node06
# new一个新 rgw 实例,ceph-deploy 只支持new
ceph-deploy rgw create node06
完整的输出
$ ceph-deploy --overwrite-conf config push node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (2.0.1): /bin/ceph-deploy --overwrite-conf config push node06
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] overwrite_conf : True
[ceph_deploy.cli][INFO ] subcommand : push
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f3e96c9e8c0>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] client : ['node06']
[ceph_deploy.cli][INFO ] func : <function config at 0x7f3e96ec9c08>
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.config][DEBUG ] Pushing config to node06
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[node06][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
$ ceph-deploy install --no-adjust-repos --nogpgcheck node06
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (2.0.1): /bin/ceph-deploy install --no-adjust-repos --nogpgcheck node06
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] testing : None
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fc503ac4758>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] dev_commit : None
[ceph_deploy.cli][INFO ] install_mds : False
[ceph_deploy.cli][INFO ] stable : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] adjust_repos : False
[ceph_deploy.cli][INFO ] func : <function install at 0x7fc5041125f0>
[ceph_deploy.cli][INFO ] install_mgr : False
[ceph_deploy.cli][INFO ] install_all : False
[ceph_deploy.cli][INFO ] repo : False
[ceph_deploy.cli][INFO ] host : ['node06']
[ceph_deploy.cli][INFO ] install_rgw : False
[ceph_deploy.cli][INFO ] install_tests : False
[ceph_deploy.cli][INFO ] repo_url : None
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] install_osd : False
[ceph_deploy.cli][INFO ] version_kind : stable
[ceph_deploy.cli][INFO ] install_common : False
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] dev : master
[ceph_deploy.cli][INFO ] nogpgcheck : True
[ceph_deploy.cli][INFO ] local_mirror : None
[ceph_deploy.cli][INFO ] release : None
[ceph_deploy.cli][INFO ] install_mon : False
[ceph_deploy.cli][INFO ] gpg_url : None
[ceph_deploy.install][DEBUG ] Installing stable version mimic on cluster ceph hosts node06
[ceph_deploy.install][DEBUG ] Detecting platform for host node06 ...
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connection detected need for sudo
Warning: Permanently added '[node06]:55556,[192.168.20.88]:55556' (ECDSA) to the list of known hosts.
[node06][DEBUG ] connected to host: node06
[node06][DEBUG ] detect platform information from remote host
[node06][DEBUG ] detect machine type
[ceph_deploy.install][INFO ] Distro info: CentOS Linux 7.9.2009 Core
[node06][INFO ] installing Ceph on node06
[node06][INFO ] Running command: sudo yum clean all
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Cleaning repos: base centos-sclo-rh centos-sclo-sclo devops-Extra epel extras
[node06][DEBUG ] : openresty remi-php72 remi-php73 remi-php74 remi-safe salt-latest
[node06][DEBUG ] : tools-repo updates zabbix
[node06][DEBUG ] Cleaning up list of fastest mirrors
[node06][DEBUG ] Other repos take up 23 M of disk space (use --verbose for details)
[node06][INFO ] Running command: sudo yum -y install ceph ceph-radosgw
[node06][DEBUG ] Loaded plugins: fastestmirror
[node06][DEBUG ] Determining fastest mirrors
[node06][DEBUG ] No package ceph available.
[node06][DEBUG ] Nothing to do
[node06][INFO ] Running command: sudo ceph --version
[node06][DEBUG ] ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
查看节点是否上线
$ ceph -s
cluster:
id: baf87797-3ec1-4f2c-8126-bf0a44051b13
health: HEALTH_WARN
1 pools have many more objects per pg than average
services:
mon: 3 daemons, quorum node01,node02,node03 (age 2w)
mgr: node01(active, since 8w), standbys: node02, node03
mds: kubefs:2 {0=node01=up:active,1=node02=up:active} 1 up:standby
osd: 13 osds: 13 up (since 6d), 13 in (since 7M)
rgw: 4 daemons active (node01, node02, node03, node06)
流量的请求时访问 radosgw 服务,这个时候新实例是没有引入流量的,需要修改负载均衡器增加新的节点进来,流量引入后需要确认旧服务已经不在处理业务请求后可以下线 确认请求,查看活跃连接
$ netstat -an|grep 7480
tcp 0 0 0.0.0.0:7480 0.0.0.0:* LISTEN
tcp 0 0 192.168.20.84:7480 192.168.20.84:33152 ESTABLISHED
确认请求,查看服务日志
$ tail -f /var/log/ceph/ceph-client.rgw.node01.log
确认无误可以下线,ceph-deploy 部署的服务没有 cephadm ceph orch rgw delete xx 这类工具进行下线,直接通过 systemd 停止服务即可
$ systemctl -l|grep rados
ceph-radosgw@rgw.node01.service loaded active running Ceph rados gateway
system-ceph\x2dradosgw.slice loaded active active system-ceph\x2dradosgw.slice
ceph-radosgw.target loaded active active ceph target allowing to start/stop all ceph-radosgw@.service instances at once
停止服务并检查内存状态
$ systemctl stop ceph-radosgw@rgw.node01.service
#ps axu --sort=-%mem|head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 1983 0.6 12.8 3358788 2084324 ? Ssl May03 1275:36 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph 1991 0.9 12.5 3451788 2033560 ? Ssl May03 1722:12 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph 1721 0.6 11.8 3318456 1920876 ? Ssl May03 1219:27 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph 1709 0.5 7.4 1646276 1212516 ? Ssl May03 1050:21 /usr/bin/ceph-mds -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 18979 1.0 4.5 1330064 744972 ? Ssl May03 1937:16 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph 529617 3.7 4.4 1914452 726436 ? Ssl Jul15 3153:14 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
总结
本次操作没有分析为什么使用内存高,只是着急做了迁移,这样导致在事后无法确定问题的根本原因,后期遇到问题要先分析并保留证据,其次在做迁移之类动作。
本文发布于Cylon的收藏册,转载请著名原文链接~
链接:https://www.oomkill.com/2024/09/05-5-failed-troubleshooting-for-rgw/
版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」 许可协议进行许可。