Skip to content

DRA kubelet: refactoring #132096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 13, 2025
Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 4, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

  • Refactoring (first commit)
  • Renaming of "plugin name" to "driver name" (second commit)

The rest of the system logs information using "driverName" as key in structured logging. The kubelet should do the same. This also removes the implicit assumption that "plugin name" == "driver name" from all code (one less thing readers must remember).

Which issue(s) this PR fixes:

Related-to: #132058

Special notes for your reviewer:

This is in preparation of #132058.

Does this PR introduce a user-facing change?

DRA kubelet: logging now uses `driverName` like the rest of the Kubernetes components, instead of `pluginName`.

/assign @klueska

Replaced the Manager interface with a simple struct. There was no need for the
interface.

Replaced the global instance of the DRA plugins store with a fresh instanced
owned by the manager. This makes unit testing a bit easier (no need to restore
state, would enable parallelizing long-running tests).

Simplified the plugin.PluginsStore type to just "plugin.Store" because it
stuttered. The Plugin type is kept because having one struct named after its
package is one common exception from the "don't stutter" guideline.
For the sake of clarify, the package gets imported as "draplugin" (=
<parent dir>/<package>).

Removed unused NewManager "node" parameter and replaced direct construction of
cache and manager with calls to NewManager because that is how the manager
should get constructed (less code, too).

Fixed incorrect description of Manager: the plugin store is the entity which
manages drivers, not the manager. The manager is focused on the DRA logic
around ResourceClaims.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 4, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jun 4, 2025
@k8s-ci-robot k8s-ci-robot removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 4, 2025
@pohly pohly changed the title Dra kubelet refactoring DRA kubelet: refactoring Jun 4, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 4, 2025

/priority important-soon

This blocks #132058, which is needed for DRA GA.

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 4, 2025
continue
}
client, err := m.draPlugins.NewDRAPluginClient(driverName)
plugin, err := m.draPlugins.GetDRAPlugin(driverName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation that explains that DRA plugin is a part of Kubelet and DRA driver is an external component to which DRA plugin is connecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology used elsewhere is that a DRA driver "provides (or deploys) a kubelet plugin".

I don't think we say that "DRA plugin is a part of Kubelet" or "DRA driver is an external component to which DRA plugin is connecting".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kubelet plugin" sounds like an external entity to me. I think it should be clearly explained somewhere. CSI has similar terminology, do they have an explanation somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with / actually prefer the term plugin here. The driver is the whole thing -- this is the kubelet plugin part of that.

Copy link
Contributor Author

@pohly pohly Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this function does is "give me some plugin for this driver name, I don't care which one". In that sense, "driver name" is more appropriate than "plugin name" because the name is not identifying a specific plugin.

Update: this wasn't about the parameter. As @klueska said, what gets returned here is the local proxy for one plugin, which is represented by one *Plugin instance here.

@@ -35,8 +35,11 @@ import (
"k8s.io/kubernetes/pkg/kubelet/metrics"
)

// Plugin contains information about one registered plugin of a DRA driver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds very confusing to me. It sounds like DRA driver doesn't register itself, but registers a plugin. Can you explain or refrase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DRA driver is a higher-level concept. It does register one or more plugins. A driver is identified by its name. A plugin is identified by that same name and the endpoint.

I don't think we we should say that a driver registers itself, because what is "it" in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds logical to me, thanks for the explanations. It still contradicts with Device Plugins terminology and (may be) with the CSI one though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation / comment matches my mental model and sounds good to me.

if err != nil {
return fmt.Errorf("version check of plugin %s failed: %w", pluginName, err)
return fmt.Errorf("invalid supported gRPC versions of DRA driver plugin %s at endpoint %s: %w", driverName, endpoint, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it version or service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the version of the API which defines the service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use service in the error message to match the variable name or rename the API to validateSupportedVersions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message must match what users know about e.g. from the KEP, which is "gRPC versions". We cannot rename the method because it is part of the plugin registration handler interface.

func (h *RegistrationHandler) DeRegisterPlugin(pluginName, endpoint string) {
if p, last := h.draPlugins.remove(pluginName, endpoint); p != nil {
// This logger includes endpoint and pluginName.
func (h *RegistrationHandler) DeRegisterPlugin(driverName, endpoint string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this sounds confusing, at least to me. DeRegisterPlugin doesn't deregister DRA driver, but a plugin with the driver name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this particular parameter to "pluginName" and then map it to "driverName" elsewhere in the code, with the explanation that DRA drivers must register themselves with their driver name.

Would that be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it will be better. After your explanation above we need to decide if we want to stick with the concept of "driver registers one or more plugins" or we want to keep using the terms that were used before for other types of plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering multiple endpoints under the same plugin name (using terminology of the registration API here) is a new thing. We cannot rely on other types of plugins for guidance and have to come up with some consistent terminology ourselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me now, thanks for the explanations!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:+1

@pohly
Copy link
Contributor Author

pohly commented Jun 4, 2025

The rest of the system logs information using "driverName" as key in structured logging.

Which system? CSI and Device plugins use "plugin" term quite widely:

Of course the term is used. The question is what it stands for and whether it makes sense here for DRA.

All other DRA components and APIs use "driver name". Even the kubelet metric uses "driver name" as label.

CSI also uses "driver name". They had a separate "plugin name" parameter for CSI registrar and discarded it because it made no sense to use anything other than the "driver name".


// Get lets you retrieve a DRA Plugin by name.
// This method is protected by a mutex.
func (s *pluginsStore) get(pluginName string) *Plugin {
func (s *Store) get(pluginName string) *Plugin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to also panic on line instances := s.store[pluginName]


// GetContainerClaimInfos gets Container ClaimInfo objects
GetContainerClaimInfos(pod *v1.Pod, container *v1.Container) ([]*ClaimInfo, error)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to simplify mocking in the unit tests, but seems to not used. I agree, it can be removed, unless we come up with unit tests that want to mock Manager, e.g. if they don't want to run gRPC server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@bart0sh
Copy link
Contributor

bart0sh commented Jun 5, 2025

@pohly

CSI also uses "driver name".

Is this term is the same for CSI and DRA? Technically all 3 types of kubelet plugins (or drivers?) are similar - they can register more than one plugin with the same name, but diffferent sockets, they have common registration mechaninsm etc.
If CSI and Device Plugins use different terms (plugin and driver) and DRA wants to use either of them it's ok. If DRA wants to use something different, I'm not sure we want to add to the mess we already have in the area.

@pohly
Copy link
Contributor Author

pohly commented Jun 5, 2025

If CSI and Device Plugins use different terms (plugin and driver) and DRA wants to use either of them it's ok. If DRA wants to use something different, I'm not sure we want to add to the mess we already have in the area.

The terminology used in this PR is consistent with CSI. A "CSI driver" and "DRA driver" are more than just a kubelet plugin.

@bart0sh
Copy link
Contributor

bart0sh commented Jun 5, 2025

The terminology used in this PR is consistent with CSI. A "CSI driver" and "DRA driver" are more than just a kubelet plugin.

Technically Device Plugin is not different from CSI or DRA. It just doesn't use "driver" term.

I see then where it all came from. Device Plugin terminology is used in the DRA Kubelet code, but the rest of DRA(control plane) used CSI terminology. From this perspective, I agree, renaming plugin to driver makes sense. However, this concept of Driver/Plugin/DRA Manager should be documented somewhere. Plugin term is also overloaded. Generally it means gRPC server serving DRA service, but the term is used also for structures inside Kubelet responcible for the client operations (establishing connectons to the gRPC server, registering external plugins, etc.

Comment on lines -33 to -35
// draPlugins map keeps track of all registered DRA plugins on the node
// and their corresponding sockets.
var draPlugins = &pluginsStore{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now embedded in the Manager type -- great -- I have never liked that this was a global variable.

// If a kubeClient is provided, then it synchronizes ResourceSlices
// with the resource information provided by plugins.
func NewRegistrationHandler(kubeClient kubernetes.Interface, getNode func() (*v1.Node, error), wipingDelay time.Duration) *RegistrationHandler {
func NewRegistrationHandler(draPlugins *Store, kubeClient kubernetes.Interface, getNode func() (*v1.Node, error), wipingDelay time.Duration) *RegistrationHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels awkward to have this take an already initialized (but empty) *Store object which it then caches locally and populates. Does it make sense to have this return (*RegistrationHandler, *Store) so that the component that actually populates this object owns this object and the one that consumes it is the one that caches a reference to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, RegistrationHandler is one layer above the central store. It would be possible to use a store without a registration handler, or with some other way of registering endpoints. With that in mind, constructing the store separately seems consistent.

If this feels like overkill, then let's eliminate the RegistrationHandler. The store can implement cache.PluginHandler directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not particularly care about RegistrationHandler being separate. I kept it because it was that way before and only clarified the layering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the store implementing cache.PluginHandler directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easier to implement on top of b92cf11 because that commit already moves some of the code from the RegistrationHandler into the Store.

I've pulled that commit into this PR and completed the removal of RegistrationHandler. Definitely the right choice, the end result is less complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska

I like the idea of the store implementing cache.PluginHandler directly

Do you also like an idea of the store implementing plugin connection management?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well at this point, I'd just argue that maybe store isn't the right name for this abstraction, but rather something like PluginManager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds okay to me. You can never have too many managers... okay, perhaps you can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the rename in a separate commit because then the file renaming is handled better in a diff because the files are similar enough: dec98e8

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but overall looks great. Thanks for the cleanup!

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2025
@pohly pohly force-pushed the dra-kubelet-refactoring branch from dec98e8 to 7f975b8 Compare June 6, 2025 13:14
pohly added 3 commits June 6, 2025 18:24
The rest of the system logs information using "driverName" as key in structured
logging. The kubelet should do the same.

This also gets clarified in the code, together with using consistent a
consistent name for a Plugin pointer: "plugin" instead of "client" or
"instance".

The New in NewDRAPluginClient made no sense because it's not constructing
anything, and it returns a plugin, not a client -> GetDRAPlugin.
This moves wiping into the plugins store. The advantages are:
- Only a single mutex is needed in the plugin store.
- The code which decides about queuing and canceling wiping
  has access to all relevant information while that information
  is protected against concurrent modifications.
- It prepares for making that code more complex ("connection
  monitoring").

In retrospect it is not clear whether the RegistrationHandler did the right
thing in all cases when endpoints got registered and removed concurrently.

The RegistrationHandler became a thin, stateless layer on top of the store.
Because it's not really needed anymore, the required methods for the plugin
manager now get provided directly by the store.

The disadvantage is the slightly more complex initialization, but that was also
a problem before which just hadn't been solved: wiping ran without the context
of the manager. Now it does.
After merging with RegistrationHandler, the store abstraction is more than a
dump map. It implements additional logic, so renaming it seems warranted.
To avoid confusion with other "plugin managers", the DRA prefix is used.
The plugin type gets updated accordingly.

This is done in a separate commit for ease of review. The "store" field is
kept because that really is just a dump lookup structure.
@pohly pohly force-pushed the dra-kubelet-refactoring branch from 7f975b8 to 7b1f499 Compare June 6, 2025 16:25
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 6, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some concerns with the approach taken here to introduce an overall DRAPluginMananger abstraction as a replacement for the separate PluginStore and PluginRegistration abstractions: #132058 (comment)

However, I find merging them into a single entity both logical and elegant, since it now puts all Plugin management in a single place and removes the need for cross-component locking to ensure specific operations happen atomically.

As such, I am inclined to approve this PR as-is.

We can always revisit this and separate them again in the future if we find a good reason to do so (e.g. testing becomes too convoluted).

@klueska
Copy link
Contributor

klueska commented Jun 13, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6f423bb28875bdfd9fd3f41321b3a97e6e56ed26

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 54291a5 into kubernetes:master Jun 13, 2025
18 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 13, 2025
@github-project-automation github-project-automation bot moved this from Triage to Done in SIG Node CI/Test Board Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

4 participants