Kubernetes: kube-scheduler ãã½ã¼ã¹ã³ã¼ãã¬ãã«ã§çè§£ãã
ã¯ããã«
Kubernetes ã«ããã¦ãPod ãé ç½®ããããã® Node ãæ±ºå®ããæç¶ããã¹ã±ã¸ã¥ã¼ãªã³ã°ã¨å¼ã³ãããã©ã«ãã®ã¯ã©ã¹ã¿ã§ã¯ kube-scheduler ããã®è²¬åãæ ã£ã¦ãã¾ããæ¬è¨äºã§ã¯ãã® kube-scheduler ã®ã½ã¼ã¹ã³ã¼ããæç³»åã«æ²¿ã£ã¦è¿½ãã¤ã¤ãã©ã®ãããªãã¸ãã¯ã§ Pod ãé ç½®ãã Node ãæ±ºå®ãããã®ãã解説ãã¾ãã
ãªããæ¬è¨äºã¯ Kubernetes ã®å é¨å®è£ ã«ã¤ãã¦å¦ã¶åå¼·ä¼ Kubernetes Internal #3 ã®è£è¶³è³æãæå³ãã¦å·çããã¾ãããæ¬æä¸ã§åç §ãã¦ããã½ã¼ã¹ã³ã¼ãã®ãã¼ã¸ã§ã³ã¯ v1.19.4 ã§ãã
ã¹ã±ã¸ã¥ã¼ã©ã®æ¦è¦
ã½ã¼ã¹ã³ã¼ããèªãã«å ç«ã¤äºåç¥èã¨ãã¦ãã¹ã±ã¸ã¥ã¼ãªã³ã°ã®å¤§ã¾ããªæµã㨠Scheduling Framework ã®æ¦è¦ã«è§¦ãã¦ããã¾ãã
ã¹ã±ã¸ã¥ã¼ãªã³ã°ã®æµã
ReplicaSet ã Job ã«ãã£ã¦ä½æãããã°ããã® Pod ã¯ãé ç½®å ã® Node ãæªå®ã®ç¶æ ã«ãªã£ã¦ãã¾ãããã® Node ãæ±ºå®ããã®ã kube-scheduler ã®å½¹å²ã§ãããããã«ãã®çµæãå Node ã§ç¨¼åãã¦ãã kubelet ãè¦ã¦ãèª Node ã«å²ãå½ã¦ããã Pod ãå®éã«ç«ã¡ä¸ãããã¨ããã®ã Kubernetes ã®åä½åçã§ãã
é ç½®å ãæ±ºã¾ã£ã¦ããªã Pod ããã¥ã¼ã«æ ¼ç´ããã¦ããã¨èãã¦ãã ãããkube-scheduler ã®ã¡ã¤ã³é¨åã¯ä»¥ä¸ã®ããã«åä½ãã¾ãã
- Pod ãä¸ã¤ããã¥ã¼ããåãåºã
- åãåºãã Pod ãé ç½®ã§ããå¯è½æ§ããã Node ããã£ã«ã¿ãªã³ã°ãã
- å¯è½æ§ã®ãã Node ã®ãã¡æãè¯ããã®ãé¸ã¶
- API Server çµç±ã§ Bind ãªã½ã¼ã¹ã使ãã
1 ãã 3 ã¾ã§ãã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã4 ããã¤ã³ãã£ã³ã°ãµã¤ã¯ã«ã¨å¼ã³ã¾ããã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã¯ç´ååããã¦ãã常㫠Pod ãä¸ã¤ãã¤æ±ãã¾ããããã¤ã³ãã£ã³ã°ãµã¤ã¯ã«ã¯ Pod ãã¨ã« goroutine ä¸ã§å®è¡ããã¾ããã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã¯ 3 ã§ Node ã確å®ãã¦ãã¤ã³ãã£ã³ã°ãµã¤ã¯ã«ã® goroutine ãèµ·åããããã¨ãå®äºãå¾ ããã«æ¬¡ã® Pod ã®ã¹ã±ã¸ã¥ã¼ã«ãµã¤ã¯ã«ãéå§ãã¾ãã
ãªãã2 ã§æ¡ä»¶ã«ãã£ã Node ã®ã¿ãã£ã«ã¿ãªã³ã°ããéãå ´åã«ãã£ã¦ã¯ä¸ã¤ãæ®ããªããã¨ãããã¾ãããã®ãããªå ´åã« Kubernetes ã¯ã ãã® Pod ããã Priority ãä½ã稼åä¸ Pod ãåé¤ãããã¨ã§ Node ã«ç©ºããä½ããã¨ãã¾ãããã®æç¶ãã Preemption ã¨å¼ã³ã¾ãã

Scheduler Extender
ä¸ã§è¿°ã¹ãæµãã«å¯¾ãã¦ã追å ãã¸ãã¯ãå·®ãè¾¼ããã¨ãã§ããæ©æ§ãç¨æããã¦ãã¾ããããããã kube-scheduler ã«è¨å®ãå ãã¦ããã¨ã¹ã±ã¸ã¥ã¼ãªã³ã°ã®ç¹å®ã®ãã¤ã³ãã§ HTTP ãªã¯ã¨ã¹ããå¤é¨ãµã¼ãã«éä¿¡ãããããã«ã¬ã¹ãã³ã¹ãããã¨ã§ kube-scheduler ã®æåãå¤ãããã¨ãã§ãã¾ãã
æ¡å¼µã§ããã®ã¯ä»¥ä¸ã® 4 ç¹ã§ãã
- Filter: Node ããã£ã«ã¿ãªã³ã°ããçµæãåãåããããã«åè£ãçµã
- Prioritize: Node ã鏿ããéã®åªå 度ä»ã颿°ã追å ãã
- Bind: ãã¤ã³ãã£ã³ã°ãµã¤ã¯ã«å ã§è¿½å ã§ä»ã®å¦çãè¡ã
- Preempt: Preemption ã§ç ç²ã¨ãªãäºå®ã® Pod ãåãåãããã®ä¸ã«åé¤ãããããªã Pod ãããã°å¤ã
Scheduling Framework
Extender ãå©ç¨ããã¹ã±ã¸ã¥ã¼ãªã³ã°ãã¸ãã¯ã®ã«ã¹ã¿ãã¤ãºã«ã¯ãããã¤ãåé¡ç¹ãææããã¦ãã¾ããã
ã¾ãåç´ã«æ¡å¼µç¹ãå°ãªããã«ã¹ã¿ãã¤ãºã®ä½å°ã«å¶éããããã¨ãã¾ããå¤é¨ãµã¼ãã« JSON Webhook ãæããã¨ããå®è£ ä¸ããã©ã¼ãã³ã¹ãè½ã¡ããã¨ãããã«ãWebhook ãµã¼ããå¤é¨ã«ãããã¨ã§ãæ¡å¼µç¹ãã¾ããã æ å ±ã®åãæ¸¡ããã¨ã©ã¼æã®ãã³ããªã³ã°ãã¹ã±ã¸ã¥ã¼ã©å´ã§ã¯ã³ã³ããã¼ã«ã§ããªããã¨ã
ããã« kube-scheduler ã®å®è£ èªä½ã種ã ã®æ¡å¼µã«ããè¥å¤§åãé²ãã§ãããå¤é¨ã§æ°ããªã«ã¹ã¿ã ã¹ã±ã¸ã¥ã¼ã©ãå®è£ ãããã¨ããã¨ããªãã®é¨åã kube-scheduler ã®åå®è£ ã«ãªããããå¾ã¾ããã§ããã
ãã®åé¡ã«å¯¾ãã¦ãã³ã³ããã¼ã«ããã¼é¨åã ããæä¾ããã¸ãã¯ãå·®ãæ¿ããã¹ã±ã¸ã¥ã¼ã©ã使ã§ããããã«ããä»çµã¿ã Scheduling Framework ã§ããScheduling Framework ã§ã¯ã以ä¸ã®æ¡å¼µç¹ãå®ç¾©ãã¦ãã¾ãã
- ã¹ã±ã¸ã¥ã¼ãªã³ã°å¾
ã¡ãã¥ã¼ã«ä½ç¨ãããã®
- QueueSort: åªå 度ä»ããã¥ã¼ã®åªå åº¦é¢æ°ã夿´ãã
- ãã£ã«ã¿ãªã³ã°ã«ä½ç¨ãããã®
- PreFilter
- FIlter
- ã¹ã³ã¢ãªã³ã°ã«ä½ç¨ãããã®
- PreScore
- Score
- NormalieScore
- ãã¤ã³ãã«ä½ç¨ãããã®
- Reserve: ãã¤ã³ãåã« Node ã Volume ãªã©ã®ç¢ºä¿ããã£ãã·ã¥ã«ç»é²ãã
- Permit: ãã¤ã³ãã£ã³ã°ãµã¤ã¯ã«ã®å é ã§ Pod ã䏿¦å¾ æ©ããã
- PreBind
- Bind
- PostBind
ä»åã®ã³ã¼ããªã¼ãã£ã³ã°ã§ã¯ããããã®æ¡å¼µç¹ãã©ã®ããã«å®è£ ããã¦ããããå«ããã¹ã±ã¸ã¥ã¼ãªã³ã°ã®æµãã追ãã¤ã¤ç¢ºèªãã¦ããã¾ãã

ãã©ã°ã¤ã³ã®å®è£
Scheduling Framework ã®ãã©ã°ã¤ã³ã¯ãName ãè¿ã Plugin interface ã®ä»ã忡張ç¹ã«å¯¾å¿ãã interface ãå®è£
ãã¦ããå¿
è¦ãããã¾ãã以ä¸ã¯ Pod ãé
ç½®ã§ããå¯è½æ§ããã Node ããã£ã«ã¿ãªã³ã°ãã Filter ãã©ã°ã¤ã³ã® interface ã§ãã
type FilterPlugin interface { Plugin Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status }
ãããããå®è£
ãæä¾ããã¦ãã Scheduling Framework ã®ãã©ã°ã¤ã³ã¯ pkg/scheduler/framework/plugins ã«é
ç½®ããã¦ãã¾ãã
åãã©ã°ã¤ã³ã¯ä¸è¬ã«ã¯è¤æ°åã®æ¡å¼µç¹ããµãã¼ããã¦ãã¾ããä¾ãã° pkg/scheduling/framework/plugins/node_affinity.go ã«ãã NodeAffinity ãã©ã°ã¤ã³ã¯ Filter 㨠Score ã®ä¸¡æ¹ã®ã¡ã½ãããå®è£
ãã¦ããã両æ¹ã®æ¡å¼µç¹ã§ã®æåã«å½±é¿ãä¸ãã¾ãã
var _ framework.FilterPlugin = &NodeAffinity{} var _ framework.ScorePlugin = &NodeAffinity{}
ã¾ããä¸ã«ç¤ºãã Filter ãè¦ã¦åããããã«ããã©ã°ã¤ã³ã®ã¡ã½ãã㯠*CycleState ã弿°ã«åãã¾ãããã® CycleState ã¯æä»å¶å¾¡ä»ãã® map ã§ãããè¤æ°ã®æ¡å¼µç¹ããã©ã°ã¤ã³ã®éã§ãã¼ã¿ãå
±æããããã«ä½¿ç¨ã§ãã¾ãããã ããCycleState ã¯ãã®åã®éããã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ãéå§ããããã¨ã«ãªã»ãããããããããã¼ã¿ãå
±æã§ããã®ã¯ããã¾ã§ãä¸ã¤ã® Pod ã®ã¹ã±ã¸ã¥ã¼ãªã³ã°å
ã§ãã

ã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã®å®è£
ã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã®æ¬ä½ã¯ pkg/scheduler/scheduler.go ã§å®è£
ããã¦ãã¾ããkube-scheduler ã¯å®è¡ããã㨠SchedulingQueue ãèµ·åãããä¸ã§ãscheduleOne ã®ç¡éã«ã¼ãã«å
¥ãã¾ãã
func (sched *Scheduler) Run(ctx context.Context) { if !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) { return } sched.SchedulingQueue.Run() wait.UntilWithContext(ctx, sched.scheduleOne, 0) sched.SchedulingQueue.Close() }
ãã® scheduleOne ãããã¹ã±ã¸ã¥ã¼ã©ã®ãã¸ãã¯ã®ã³ã¢é¨åã§ãããåè¿°ããã¹ã±ã¸ã¥ã¼ãªã³ã°ã®æµãã«ã»ã¼ãã®ã¾ã¾å¯¾å¿ãã¦ãã¾ãã以ä¸ãå®è£
ã«ã沿ã£ã¦è©³ç´°ãè¦ã¦ããã¾ãããã
Pod ã®åãåºã
åé ãNextPod ã«ããã¹ã±ã¸ã¥ã¼ãªã³ã°ã®å¯¾è±¡ã¨ãªã Pod ããã¥ã¼ããã²ã¨ã¤åãåºãã¾ããå¾è¿°ãã¾ããããã®ãã¥ã¼ã¯ Pod ã® Priority ã«ããåªå
度ä»ããã¥ã¼ã¨ãã¦æ¯ãèãã¾ãã
podInfo := sched.NextPod()
次ã«ãprofileForPod ã§åãåºãã Pod ã«å¯¾å¿ãã Scheduling Profile ãåå¾ãã¦ã©ã®ãããªãã¸ãã¯ãæ¡ç¨ããã®ããæ±ºãã¾ããã¾ããåé¤ä¸ã§ãã£ããããã§ã«é
ç½®å
ãæ±ºã¾ã£ã¦ãã Node ãå度è¦ã¤ããå ´å㯠skipPodUpdate(pod) ã§ã¹ãããããä½ãè¡ãã¾ããã
pod := podInfo.Pod prof, err := sched.profileForPod(pod) (snip) if sched.skipPodSchedule(prof, pod) { return }
æå¾ã«ãã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ãå§ããã«ããã£ã¦ããã©ã°ã¤ã³ã®ãã¼ã¿ç½®ãå ´ã§ãã CycleState ããªã»ãããã¦ããã¾ãã
state := framework.NewCycleState()
é ç½®å Node ã®æ±ºå®
å ã«æ¦è¦ã¨ãã¦è¿°ã¹ãéãmPod ãé ç½®ãã Node ãæ±ºå®ããã«ããã£ã¦ã¯ã
- åè£ã¨ãªã Node ãçµã
- æ®ã£ã Node ã«åªå 度ãã¤ãã
- ä¸çªåªå 度ã®é«ã Node ã鏿ãã
ã¨ããæç¶ããè¡ããã¾ãããããä¸é£ã®åä½ã¯ pkg/scheduler/core/generic_scheduler.go ã§å®ç¾©ããã Schedule ã¡ã½ããã§è¡ããã¾ãã
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, prof, state, pod)
ãã®ä¸ã§ãä¸è¨ã®ãåè£ã¨ãªã Node ãçµãããè¡ãã®ã findNodesThatFitPodããæ®ã£ã Node ã«åªå
度ãã¤ããããè¡ãã®ã prioritizeNodesããä¸çªåªå
度ã®é«ã Node ã鏿ããããè¡ãã®ã selectHost ã§ãã
feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, prof, state, pod)
priorityList, err := g.prioritizeNodes(ctx, prof, state, pod, feasibleNodes)
host, err := g.selectHost(priorityList)
Node ã®ãã£ã«ã¿ãªã³ã°
ã§ã¯ããã« findNodesThatFitPod ã®ä¸èº«ã確èªãã¦ããã¾ããã¾ããPreFilter ãã©ã°ã¤ã³ãå¼ã³åºããã¾ãã
s := prof.RunPreFilterPlugins(ctx, state, pod)
ããã¦å®éã« Node ããã£ã«ã¿ãªã³ã°ããå¦çã¯ããã«äºã¤ã®å¦çãããªãã¾ãã
ã¾ã䏿®µéç®ã¯ Filter ãã©ã°ã¤ã³ã«ãã findNodesThatFitPod ã§ãã
feasibleNodes, err := g.findNodesThatPassFilters(ctx, prof, state, pod, filteredNodesStatuses)
findNodesThatFitPod ã®å
é¨ã§ã¯ãå Node ã«å¯¾ã㦠Pod ã¨ã®ç¸æ§ã並è¡ãã¦ãã§ãã¯ãããããã«ãªã£ã¦ãã¾ããcheckNode ããã§ãã¯ããå¦çã§ãããã«ãã®å
é¨ã§å¼ã°ãã¦ãã PodPassesFiltersOnNode ãå®éã« Filter ãã©ã°ã¤ã³ãå¼ã³åºãé¨åã«ãªã£ã¦ãã¾ãã
parallelize.Until(ctx, len(allNodes), checkNode)
checkNode := func(i int) { // We check the nodes starting from where we left off in the previous scheduling cycle, // this is to make sure all nodes have the same chance of being examined across pods. nodeInfo := allNodes[(g.nextStartNodeIndex+i)%len(allNodes)] fits, status, err := PodPassesFiltersOnNode(ctx, prof.PreemptHandle(), state, pod, nodeInfo) //(snip) }
åé ã§ãæç¤ºãã¾ããããFilter ãã©ã°ã¤ã³ã¯ Filter ã¡ã½ããã«ããå®ç¾©ããã interface ã«ãªã£ã¦ãã¾ããPodPassesFiltersOnNode ã®å
é¨ã§ã¯ RunFilterPlugin ãå®è¡ããã¦ããããããåãã©ã°ã¤ã³ã® Filter çµæãçµ±åãã¦ä¸ã¤ã§ã失æãã Node ã¯åè£ããå¤ãããä»çµã¿ã«ãªã£ã¦ãã¾ããå¦çã®æ¬ä½ã¯ pkg/scheduler/framework/runtime/framework.go ã«ããã¾ãã
ã¾ããäºæ®µéç®ã® findNodesThatPassExtenders ã§ã¯ Extender ã«ããå¤å®ãè¡ããã¾ãã
feasibleNodes, err = g.findNodesThatPassExtenders(pod, feasibleNodes, filteredNodesStatuses)
ããããã®æ®µéã§åè£ã¨ãªã Node ãè¦ã¤ãããªãã£ãå ´åã¯ãPreemotion ã®æç¶ãã«é²ã¿ã¾ããããã§ã¯ä¸æ¦ãä¸ã¤ä»¥ä¸ã® Node ãè¦ã¤ãã£ãã¨ãã¦è©±ãç¶ãã¾ãããã
Node ã®ã¹ã³ã¢ãªã³ã°
prioritizeNodes ã§ãã£ã«ã¿ãªã³ã°ã®çµæçãæ®ã£ã Node ã«åªå
é åºãã¤ãã¾ããFilter ã®ã¨ãã¨åãããããã§ãã¾ã PreScore ãã©ã°ã¤ã³ãå¼ã³åºããã¾ãã
preScoreStatus := prof.RunPreScorePlugins(ctx, state, pod, nodes)
ããã¦ãã¯ã Filter ã®æã¨åãããã¹ã³ã¢ãªã³ã°ããã©ã°ã¤ã³ã«ãããã®ã¨ Extender ã«ãããã®ãåããã¦èãã¾ããã¾ããScore ãã©ã°ã¤ã³ã«ããã¹ã³ã¢ã®ç®åºã§ãããã©ã°ã¤ã³åã¨ãã®ãã©ã°ã¤ã³ã«ãã Node ã®æ¡ç¹çµæã map ã«ãªã£ã¦è¿ããã¾ããNormalizeScore ãã©ã°ã¤ã³ã«ããæ£è¦åãããã§è¡ããã¾ãã
scoresMap, scoreStatus := prof.RunScorePlugins(ctx, state, pod, nodes)
ããã¦ããã®çµæã« Extender ã«ããã¹ã³ã¢ãè¶³ãè¾¼ã¿ã¾ããExtender ãè¤æ°ç»é²ããã¦ããå ´åãå Extender ãã¨ã«ä¸¦è¡ãã¦è¨ç®ããã¾ãã
for i := range g.extenders { // (snip) go func(extIndex int) { // (snip) prioritizedList, weight, err := g.extenders[extIndex].Prioritize(pod, nodes) mu.Lock() for i := range *prioritizedList { host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score // (snip) combinedScores[host] += score * weight } mu.Unlock() }(i) }
ãªãããã©ã°ã¤ã³ã«ããã¹ã³ã¢ã¨ Extender ã«ããã¹ã³ã¢ã®åç®ã®éã«ã¯ã¹ã±ã¼ã«èª¿æ´ãè¡ã£ã¦ãã¾ãã
result[i].Score += combinedScores[result[i].Name] * (framework.MaxNodeScore / extenderv1.MaxExtenderPriority)
Node ã®é¸æ
æå¾ã«ãselectHost ã Node ãã¨ã®ã¹ã³ã¢ã®çµæ priorityList ãåãåã£ã¦ Node ãä¸ã¤é¸æãã¾ããselectHost ã®ä¸èº«ã¯åç´ã«ã«ã¼ããåãã¦æå¤§å¤ã鏿ããï¼åç¹ã®å ´åã¯ã©ã³ãã ï¼ã ãã§ãã
host, err := g.selectHost(priorityList)
ãã¤ã³ãã£ã³ã°ãµã¤ã¯ã«ã®å®è£
Pod ã®é ç½®å Node ãæ±ºå®ãããã¨ãå®éã« Pod ã® Status ãæ¸ãæãã¦é ç½®ãè¡ãæä½ã¯ Pod ãã¨ã« goroutine ãçºè¡ãããã¨ã§è¡ããã¾ããããã¯ãVolume ã®ãããã¸ã§ãã³ã°ã®å¾ ã¡æéãå¾è¿°ã® CoScheduling ã«ãã£ã¦ Pod ãããã«èµ·åã§ããªãå ´åã§ãã£ã¦ããå ã«æ¬¡ã® Pod ã®ã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ãéå§ããããã§ãã
ã¾ããå¦çã goroutine ã¨ãã¦åå²ããåã«ãReserve ãã©ã°ã¤ã³ã¨ Permit ãã©ã°ã¤ã³ãå¼ã³åºããã¾ãã
if sts := prof.RunReservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {...}
runPermitStatus := prof.RunPermitPlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
ãã®æ¬¡ãPermit ãã©ã°ã¤ã³ã®å¼ã³åºããçµãã£ããã¨ã®å¦çãè¦ãã¨ãgoroutine ã§ç¡å颿°ãå®è¡ãã¦ãããã¨ããããã¾ãã
go func() { bindingCycleCtx, cancel := context.WithCancel(ctx) defer cancel() // (snip) }()
ãã® goroutine ã®ä¸ã§æåã«å®è¡ãããã®ã WaitOnPermit ã§ããPermit ãã©ã°ã¤ã³ã«ãã£ã¦è¨±å¯ãè¡ãããã¾ã§ Pod ã¯å¾
æ©ç¶æ
ã«ãªãã¾ããgoroutine ã¨ãã¦åå²ããå¾ãªã®ã§ãããã§ Pod ãå¾
ãããã¦ããéãå¾ç¶ã® Pod ã¯æ¬¡ã®ã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ã«å
¥ããã¨ãã§ãã¾ãã
waitOnPermitStatus := prof.WaitOnPermit(bindingCycleCtx, assumedPod)
Pod ãå¾ æ©ç¶æ ããè§£æ¾ãããã¨ãã¾ã PreBInd ãã©ã°ã¤ã³ãå¼ã³åºããã¾ãã
preBindStatus := prof.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
ããã¦å®éã®ãã¤ã³ãå¦ç㯠bind ãå®è¡ããã¾ããbind ã®ä¸èº«ã¯ Filter ã Score ã¨åãããã©ã°ã¤ã³ã¨ Extender ã®ä¸¡æ¹ã使ç¨ããã¾ãããä»å㯠Extender ã®å®è¡ãå
ã§ããã¤ãã®æç¹ã§ãã¤ã³ããæåããã°ãã©ã°ã¤ã³ã«ã¯å¦çãæ¸¡ããªãããã«ãªã£ã¦ãã¾ãã
func (sched *Scheduler) bind(ctx context.Context, prof *profile.Profile, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) { // (snip) bound, err := sched.extendersBinding(assumed, targetNode) if bound { return err } bindStatus := prof.RunBindPlugins(ctx, state, assumed, targetNode) // (snip) }
æå¾ã« PostBind ãã©ã°ã¤ã³ãå¼ã³åºãã¦ä¸é£ã®å¦çãçµäºã§ãã
prof.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
ãªããReserve ãã©ã°ã¤ã³å¼ã³åºãå¾ã® Volume æä½ããã©ã°ã¤ã³å¼ã³åºãã§ã¨ã©ã¼ãçºçããå ´åãReserve ãåãæ¶ãå¿ è¦ãããã¾ãããã®ãããåã¨ã©ã¼ãã³ããªã³ã°å ã§ã¯ Reserve ãã©ã°ã¤ã³ãå¼ã³åºãã¦åãæ¶ãå¦çãè¡ãããã«ãªã£ã¦ãã¾ãã
prof.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
Preemption
ããã¾ã§ãæåã®ãã£ã«ã¿ãªã³ã°ã®æ®µéã§å°ãªãã¨ãä¸ã¤ã® Node ãè¦ã¤ãã£ãå ´åã®æµãã«ã¤ãã¦è§£èª¬ãã¦ãã¾ããããããå®éã®éç¨ä¸ã«ã¯å¿ ãããåè£ã¨ãªã Node ãåå¨ããã¨ã¯éãã¾ããã
ã§ã¯ãæ¡ä»¶ã«åã Node ãå
¨ãè¦ã¤ãããªãã£ãå ´åã®å¦çãè¦ã¦ããã¾ãããããã®ã¨ã Schedule ã FitError ãè¿ããã¨ã«ãã Preemption ã®æç¶ããéå§ããã¾ãã
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, prof, state, pod) if err != nil { nominatedNode := "" if fitError, ok := err.(*core.FitError); ok { if !prof.HasPostFilterPlugins() { klog.V(3).Infof("No PostFilter plugins are registered, so no preemption will be performed.") } else { // Run PostFilter plugins to try to make the pod schedulable in a future scheduling cycle. result, status := prof.RunPostFilterPlugins(ctx, state, pod, fitError.FilteredNodesStatuses) if status.IsSuccess() && result != nil { nominatedNode = result.NominatedNodeName } } // (snip) }
ããã§ãnominatedNode ã«ã¯ãPreemption ã®çµæ Pod ãç«ã¡éãããã¦ç©ºããã§ãã Node ã®ååãè¨é²ããã¾ãããã ããæçµçã« Pod ããã® nominatedNode ã«é
ç½®ãããã¨ã¯éãã¾ãããå度ã¹ã±ã¸ã¥ã¼ãªã³ã°ãµã¤ã¯ã«ãééããéã«ä»ã® Pod ã«åãããã¦ãã¾ãå¯è½æ§ãããå¾ã¾ãã
Preemption å¦çã®ãã¸ãã¯ã¯ PostFilter ãã©ã°ã¤ã³ãæ å½ãã¾ãã
result, status := prof.RunPostFilterPlugins(ctx, state, pod, fitError.FilteredNodesStatuses)
ããã©ã«ãã®ç¶æ
ã§ã¯ PostFilter 㯠DefaultPreemption ãã©ã°ã¤ã³ã®ã¿ãå®è£
ãã¦ãã¾ãã ã½ã¼ã¹ã³ã¼ã㯠pkg/scheduler/framework/plugins/default/preemption/default_preemption.go ã§ãã
å¦çã®æ¬ä½ã¯ preempt ã¡ã½ããã§ããã以ä¸ã®å¦çãè¡ããã¾ãã
- ãããããã® Pod ã Preempt 対象ãã©ãã調ã¹ã
- åé¤ã§ããå¯è½æ§ã®ãã Pod ããªã¹ãã¢ãããã
- Extender ã«åãåããã¦é¤å¤ãã¹ã Node ããªããã©ãã調ã¹ã
- ãã£ã¨ãå½±é¿ã軽ã Node ã鏿ãã
- å®éã« Pod ãåé¤ãã
ãã¸ãã¯ã®ä¸å¿ã¨ãªã£ã¦ããã®ã¯ãåè£ã¨ãªã Node ããªã¹ãã¢ããããããè¡ã selectVictimsOnNode ã§ãã大ã¾ãã«
- ã¾ããå Node ä¸ã§ãèãã¦ãã Pod ãã Priority ãä½ã Pod ãå ¨é¨åé¤ããç¶æ ãèãã
- ãããã Pod ãä¸ã¤ã¥ã¤æ»ãã¦ã¿ããPodDisruptionBudget ã«å½±é¿ããã Pod ãåªå ãã¦æ»ã
- ãã以䏿»ããªããããªãã¡æ°ãã Pod ãé ç½®ããããã«å¿ è¦ãªæä½éã®ç ç²ãã©ã® Pod ã«ãªããã調ã¹ã
ã¨ããé åºã§å¦çãè¡ããã¾ãã
åèæç®
éå»ã®ã¹ã©ã¤ã
å ¬å¼ããã¥ã¡ã³ã
ã¾ã¨ã
Kubernetes ã«ããã¦ãPod ãé ç½®ããããã® Node ã鏿ããæç¶ããã¹ã±ã¸ã¥ã¼ãªã³ã°ã¨å¼ã³ã¾ããæ¬è¨äºã§ã¯ãããã©ã«ãã®ã¹ã±ã¸ã¥ã¼ã©ã§ãã kube-scheduler ã®ã½ã¼ã¹ã³ã¼ãã追ããã¨ã§ãNode ãã©ã®ããã«é¸æãããã®ãã®å é¨ã¢ã«ã´ãªãºã ã解説ãã¾ãããç¹ã«ææ°ã® kube-scheduler ã§ã¯ãScheduling Framework ãç¨ãããã¨ã§ã¢ã«ã´ãªãºã ä¸ã®ãã¸ãã¯ãå·®ãæ¿ãå¯è½ã«ãªã£ã¦ããã®ãç¹å¾´ã§ãã
ãªããä»å大ããã¯æ±ãã¾ããã§ããããéç¨ä¸ã¹ã±ã¸ã¥ã¼ã©ã®ããã©ã¼ãã³ã¹ã«å¤§ããå½±é¿ããä»çµã¿ã¨ãã¦ãã¹ã±ã¸ã¥ã¼ãªã³ã°ãã¥ã¼ã¨ãã£ãã·ã¥ãæãããã¾ãããã¤ãæ©ä¼ãããã°ãããã®ãããã¯ã«ã¤ãã¦ã解説ãããã¨æãã¾ãããããã¯ã¾ãå¥ã®è©±ã