-
Notifications
You must be signed in to change notification settings - Fork 40.7k
DRA kubelet: refactoring #132096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA kubelet: refactoring #132096
Conversation
Replaced the Manager interface with a simple struct. There was no need for the interface. Replaced the global instance of the DRA plugins store with a fresh instanced owned by the manager. This makes unit testing a bit easier (no need to restore state, would enable parallelizing long-running tests). Simplified the plugin.PluginsStore type to just "plugin.Store" because it stuttered. The Plugin type is kept because having one struct named after its package is one common exception from the "don't stutter" guideline. For the sake of clarify, the package gets imported as "draplugin" (= <parent dir>/<package>). Removed unused NewManager "node" parameter and replaced direct construction of cache and manager with calls to NewManager because that is how the manager should get constructed (less code, too). Fixed incorrect description of Manager: the plugin store is the entity which manages drivers, not the manager. The manager is focused on the DRA logic around ResourceClaims.
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/priority important-soon This blocks #132058, which is needed for DRA GA. |
pkg/kubelet/cm/dra/manager.go
Outdated
continue | ||
} | ||
client, err := m.draPlugins.NewDRAPluginClient(driverName) | ||
plugin, err := m.draPlugins.GetDRAPlugin(driverName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any documentation that explains that DRA plugin is a part of Kubelet and DRA driver is an external component to which DRA plugin is connecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terminology used elsewhere is that a DRA driver "provides (or deploys) a kubelet plugin".
I don't think we say that "DRA plugin is a part of Kubelet" or "DRA driver is an external component to which DRA plugin is connecting".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Kubelet plugin" sounds like an external entity to me. I think it should be clearly explained somewhere. CSI has similar terminology, do they have an explanation somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with / actually prefer the term plugin here. The driver is the whole thing -- this is the kubelet plugin part of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this function does is "give me some plugin for this driver name, I don't care which one". In that sense, "driver name" is more appropriate than "plugin name" because the name is not identifying a specific plugin.
Update: this wasn't about the parameter. As @klueska said, what gets returned here is the local proxy for one plugin, which is represented by one *Plugin
instance here.
pkg/kubelet/cm/dra/plugin/plugin.go
Outdated
@@ -35,8 +35,11 @@ import ( | |||
"k8s.io/kubernetes/pkg/kubelet/metrics" | |||
) | |||
|
|||
// Plugin contains information about one registered plugin of a DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds very confusing to me. It sounds like DRA driver doesn't register itself, but registers a plugin. Can you explain or refrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DRA driver is a higher-level concept. It does register one or more plugins. A driver is identified by its name. A plugin is identified by that same name and the endpoint.
I don't think we we should say that a driver registers itself, because what is "it" in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds logical to me, thanks for the explanations. It still contradicts with Device Plugins terminology and (may be) with the CSI one though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this explanation / comment matches my mental model and sounds good to me.
if err != nil { | ||
return fmt.Errorf("version check of plugin %s failed: %w", pluginName, err) | ||
return fmt.Errorf("invalid supported gRPC versions of DRA driver plugin %s at endpoint %s: %w", driverName, endpoint, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it version or service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the version of the API which defines the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use service
in the error message to match the variable name or rename the API to validateSupportedVersions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message must match what users know about e.g. from the KEP, which is "gRPC versions". We cannot rename the method because it is part of the plugin registration handler interface.
func (h *RegistrationHandler) DeRegisterPlugin(pluginName, endpoint string) { | ||
if p, last := h.draPlugins.remove(pluginName, endpoint); p != nil { | ||
// This logger includes endpoint and pluginName. | ||
func (h *RegistrationHandler) DeRegisterPlugin(driverName, endpoint string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this sounds confusing, at least to me. DeRegisterPlugin doesn't deregister DRA driver, but a plugin with the driver name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this particular parameter to "pluginName" and then map it to "driverName" elsewhere in the code, with the explanation that DRA drivers must register themselves with their driver name.
Would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it will be better. After your explanation above we need to decide if we want to stick with the concept of "driver registers one or more plugins" or we want to keep using the terms that were used before for other types of plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registering multiple endpoints under the same plugin name (using terminology of the registration API here) is a new thing. We cannot rely on other types of plugins for guidance and have to come up with some consistent terminology ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me now, thanks for the explanations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:+1
Of course the term is used. The question is what it stands for and whether it makes sense here for DRA. All other DRA components and APIs use "driver name". Even the kubelet metric uses "driver name" as label. CSI also uses "driver name". They had a separate "plugin name" parameter for CSI registrar and discarded it because it made no sense to use anything other than the "driver name". |
|
||
// Get lets you retrieve a DRA Plugin by name. | ||
// This method is protected by a mutex. | ||
func (s *pluginsStore) get(pluginName string) *Plugin { | ||
func (s *Store) get(pluginName string) *Plugin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to also panic on line instances := s.store[pluginName]
|
||
// GetContainerClaimInfos gets Container ClaimInfo objects | ||
GetContainerClaimInfos(pod *v1.Pod, container *v1.Container) ([]*ClaimInfo, error) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to simplify mocking in the unit tests, but seems to not used. I agree, it can be removed, unless we come up with unit tests that want to mock Manager, e.g. if they don't want to run gRPC server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Is this term is the same for CSI and DRA? Technically all 3 types of kubelet plugins (or drivers?) are similar - they can register more than one plugin with the same name, but diffferent sockets, they have common registration mechaninsm etc. |
The terminology used in this PR is consistent with CSI. A "CSI driver" and "DRA driver" are more than just a kubelet plugin. |
Technically Device Plugin is not different from CSI or DRA. It just doesn't use "driver" term. I see then where it all came from. Device Plugin terminology is used in the DRA Kubelet code, but the rest of DRA(control plane) used CSI terminology. From this perspective, I agree, renaming |
// draPlugins map keeps track of all registered DRA plugins on the node | ||
// and their corresponding sockets. | ||
var draPlugins = &pluginsStore{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now embedded in the Manager
type -- great -- I have never liked that this was a global variable.
// If a kubeClient is provided, then it synchronizes ResourceSlices | ||
// with the resource information provided by plugins. | ||
func NewRegistrationHandler(kubeClient kubernetes.Interface, getNode func() (*v1.Node, error), wipingDelay time.Duration) *RegistrationHandler { | ||
func NewRegistrationHandler(draPlugins *Store, kubeClient kubernetes.Interface, getNode func() (*v1.Node, error), wipingDelay time.Duration) *RegistrationHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels awkward to have this take an already initialized (but empty) *Store
object which it then caches locally and populates. Does it make sense to have this return (*RegistrationHandler, *Store)
so that the component that actually populates this object owns this object and the one that consumes it is the one that caches a reference to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, RegistrationHandler is one layer above the central store. It would be possible to use a store without a registration handler, or with some other way of registering endpoints. With that in mind, constructing the store separately seems consistent.
If this feels like overkill, then let's eliminate the RegistrationHandler. The store can implement cache.PluginHandler
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not particularly care about RegistrationHandler being separate. I kept it because it was that way before and only clarified the layering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of the store implementing cache.PluginHandler directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is easier to implement on top of b92cf11 because that commit already moves some of the code from the RegistrationHandler into the Store.
I've pulled that commit into this PR and completed the removal of RegistrationHandler. Definitely the right choice, the end result is less complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of the store implementing cache.PluginHandler directly
Do you also like an idea of the store implementing plugin connection management?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well at this point, I'd just argue that maybe store
isn't the right name for this abstraction, but rather something like PluginManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds okay to me. You can never have too many managers... okay, perhaps you can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the rename in a separate commit because then the file renaming is handled better in a diff because the files are similar enough: dec98e8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but overall looks great. Thanks for the cleanup!
dec98e8
to
7f975b8
Compare
The rest of the system logs information using "driverName" as key in structured logging. The kubelet should do the same. This also gets clarified in the code, together with using consistent a consistent name for a Plugin pointer: "plugin" instead of "client" or "instance". The New in NewDRAPluginClient made no sense because it's not constructing anything, and it returns a plugin, not a client -> GetDRAPlugin.
This moves wiping into the plugins store. The advantages are: - Only a single mutex is needed in the plugin store. - The code which decides about queuing and canceling wiping has access to all relevant information while that information is protected against concurrent modifications. - It prepares for making that code more complex ("connection monitoring"). In retrospect it is not clear whether the RegistrationHandler did the right thing in all cases when endpoints got registered and removed concurrently. The RegistrationHandler became a thin, stateless layer on top of the store. Because it's not really needed anymore, the required methods for the plugin manager now get provided directly by the store. The disadvantage is the slightly more complex initialization, but that was also a problem before which just hadn't been solved: wiping ran without the context of the manager. Now it does.
After merging with RegistrationHandler, the store abstraction is more than a dump map. It implements additional logic, so renaming it seems warranted. To avoid confusion with other "plugin managers", the DRA prefix is used. The plugin type gets updated accordingly. This is done in a separate commit for ease of review. The "store" field is kept because that really is just a dump lookup structure.
7f975b8
to
7b1f499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some concerns with the approach taken here to introduce an overall DRAPluginMananger
abstraction as a replacement for the separate PluginStore
and PluginRegistration
abstractions: #132058 (comment)
However, I find merging them into a single entity both logical and elegant, since it now puts all Plugin management in a single place and removes the need for cross-component locking to ensure specific operations happen atomically.
As such, I am inclined to approve this PR as-is.
We can always revisit this and separate them again in the future if we find a good reason to do so (e.g. testing becomes too convoluted).
/lgtm |
LGTM label has been added. Git tree hash: 6f423bb28875bdfd9fd3f41321b3a97e6e56ed26
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: klueska, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The rest of the system logs information using "driverName" as key in structured logging. The kubelet should do the same. This also removes the implicit assumption that "plugin name" == "driver name" from all code (one less thing readers must remember).
Which issue(s) this PR fixes:
Related-to: #132058
Special notes for your reviewer:
This is in preparation of #132058.
Does this PR introduce a user-facing change?
/assign @klueska