汇总Kubernetes在生产环境下遇到的各种问题

发布时间 2023-04-17 09:40:10作者: wang-hongwei

以前处理过很多问题都没做记录,或者笔记太乱不便搜寻。后面还是觉得写一篇随笔汇总记录比较好。
1、挂载卷权限问题导致pod运行异常

# 调试:增加command字段,进入容器查看应用运行uid
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 500000

# 使用initContainer修改目录权限
spec:
  initContainers:
  - command:
    - /bin/sh
    - -c
    - chmod 777 /prometheus
    image: busybox
    imagePullPolicy: IfNotPresent
    name: volume-permissions
    securityContext:
      runAsUser: 0
    volumeMounts:
    - mountPath: /prometheus
      name: prometheus-data

2、挂载卷内默认生成lost+found目录导致数据库初始化失败

Initializing database
2023-04-12T08:11:26.631401Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2023-04-12T08:11:26.636640Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting.
2023-04-12T08:11:26.636700Z 0 [ERROR] Aborting

# 调试:增加command字段,进入容器删除lost+found目录
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 500000

# 进容器删除lost+found/
mysql@flashcatcloud-nightingale-database-0:/$ cd /var/lib/mysql
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
lost+found
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ rm -r lost+found/
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls 
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ 

# 或通过挂载initContainer的方式删除lost+found目录
spec:
  initContainers:
  - command:
    - /bin/sh
    - -c
    - rm -rf /var/lib/mysql/*
    image: busybox
    imagePullPolicy: IfNotPresent
    name: volume-permissions
    resources: {}
    securityContext:
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/mysql/
      name: database-data

3、容器一直保持在terminating状态

# 查看所在节点kubelet日志: 
failed to "KillPodSandbox" for "a594f4a1-c67b-42c5-84ea-62f7fb1e386d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-b70f6268-4fed-8c40-73f4-2e0ad0d325f4: device or resource busy"

# 解决方法
echo 1 > /proc/sys/fs/may_detach_mounts 

# 基于纯shell的 kubernetes 生产集群的 sysctl 配置
https://www.boysec.cn/boy/f0530e00.html