1.1. 污点和容忍度调度

在Kubernetes中通过给一个Node设置污点，以及Pod对于这个污点的容忍度结合起来实现哪些Pod可以被调度到哪些节点上，只有当一个Pod可以容忍某个节点的污点，这个Pod才会可能被调度该节点上。

请注意污点是在Node上设置的，而容忍度是在Pod上设置的。

1.2. Taints（污点）

[root@linux-node1 ~]# kubectl describe node linux-node1.example.com | grep Taints
Taints:             node-role.kubernetes.io/master:NoSchedule

Taints的表现形式为

<key>=<value>:<effect>

key: 就是污点的key
value：污点的值，如上面是一个空的value，所以没有显示=
effect：设置当有这个污点是产生的效果。

effect的三种类型：

NoSchedule: 如果Pod没有容忍该污点，不调度到该节点上。
PreferNoSchedule：尽量阻止Pod被调度到这个节点上，但是如果没有其它节点能够调度，可以调度到该节点。
NoExecute： NoScheduler和PreferNoSchedule只是在调度阶段起作用，但是NoExecute会影响正常运行的Pod，如果一个节点被打了NoExecute的污点，而运行在该节点的Pod没有容忍会直接被这个节点移除。

1.3. 污点容忍度度

查看Flannel为何能调度到Master节点

[root@linux-node1 ~]# kubectl get po -n kube-system | grep flannel
kube-flannel-ds-amd64-f2jrk                       1/1     Running   2          22h
kube-flannel-ds-amd64-mh75v                       1/1     Running   2          22h
kube-flannel-ds-amd64-n52zm                       1/1     Running   4          22h

[root@linux-node1 ~]# kubectl describe pod kube-flannel-ds-amd64-f2jrk -n kube-system
...
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule

1.3.1. 自定义污点

案例：让Pod不能运行在用于运行Ingress Controller的Node上。

设置污点

[root@linux-node1 ~]# kubectl taint node linux-node2.example.com node-type=edge:NoSchedule
node/linux-node2.example.com tainted

创建Deployment测试

[root@linux-node1 ~]# kubectl run taint-test --image=alpine --replicas 3 sleep 360000

查看调度情况

[root@linux-node1 ~]# kubectl get pod -o wide | grep taint-test
taint-test-5967486bf-mqvj7                1/1     Running   0          54s     10.2.2.52   linux-node3.example.com   <none>           <none>
taint-test-5967486bf-stqp9                1/1     Running   0          54s     10.2.2.54   linux-node3.example.com   <none>           <none>
taint-test-5967486bf-vtwrb                1/1     Running   0          54s     10.2.2.53   linux-node3.example.com   <none>           <none>

1.3.2. 自定义污点容忍

创建一个Nginx的Deployment并设置污点容忍度

[root@linux-node1 example]# cat nginx-deployment-taint.yaml    
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.13.12
        ports:
        - containerPort: 80
      tolerations:
      - key: node-type
        operator: Equal
        value: edge
        effect: NoSchedule

创建并查看调度结果

[root@linux-node1 ~]# kubectl create -f nginx-deployment-taint.yaml 
deployment.apps/nginx-deployment created
[root@linux-node1 ~]# kubectl get pod -o wide | grep nginx-deployment
nginx-deployment-7ccbbfb64c-24gfn         1/1     Running   0          15s     10.2.1.35   linux-node2.example.com   <none>           <none>
nginx-deployment-7ccbbfb64c-fk26t         1/1     Running   0          15s     10.2.1.36   linux-node2.example.com   <none>           <none>
nginx-deployment-7ccbbfb64c-qtv2w         1/1     Running   0          15s     10.2.2.55   linux-node3.example.com   <none>           <none>

operator支持两个选项Equal和Exists，Equal用与匹配对应的value也是默认的操作符，Exists用来匹配污点的Key。

1.3.3. 删除节点的污点

[root@linux-node1 ~]# kubectl taint node linux-node2.example.com node-type-

1.4. 关键组件的调度保证

除了Kubernetes核心组件，像运行在 master 机器上的 api-server、scheduler、controller-manager之外，还有很多插件，出于各种原因必须运行在一个普通的集群节点上（而不是 Kubernetes master），例如使用kubeadm创建的集群使用了DaemonSet运行kube-proxy，如果一个Pod被定义为关键组件，那么

成为关键插件

想变成关键插件，首先该Pod必须运行在 kube-system Namespace 中，并且需要符合以下两个条件：

需要将 priorityClassName 设置为 system-cluster-critical 或 system-node-critical ，后者是整个群集的最高级别。或者，也可以为 Pod 添加名为 scheduler.alpha.kubernetes.io/critical-pod、值为空字符串的注解。不过，这一注解从 1.13 版本开始不再推荐使用，并将在 1.14 中删除。
将PodSpec 的 tolerations 字段设置为 [{"key":"CriticalAddonsOnly", "operator":"Exists"}]

kube-proxy案例

[root@linux-node1 ~]# kubectl describe pod kube-proxy-7gxrh -n kube-system | grep Only
                 CriticalAddonsOnly
[root@linux-node1 ~]# kubectl describe pod kube-proxy-7gxrh -n kube-system | grep Priority
Priority:             2000001000
Priority Class Name:  system-node-critical

1.5. 默认的污点容忍度

[root@linux-node1 ~]# kubectl get pod jenkins-deployment-687ffcd9c4-l7b5k -o yaml
...
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
...

创建的Pod如果没有增加以上两个容忍度的，会默认加上。

key: node.kubernetes.io/not-ready：表示Pod所在节点处于not ready的时候，Pod将会等待300秒之后把该Pod重新调度到其它节点上。
key：node.kubernetes.io/unreachable：表示Pod所在的节点处于unreachable状态时，Pod将会等待300秒之后把Pod重新调度到其它节点上。

如果你认为300秒时间有点长，可以在Pod定义中添加自定义的时间。

22.2 污点和容忍度调度