-
Notifications
You must be signed in to change notification settings - Fork 40.7k
DRA: Allow subrequests to set count to zero with AllocationMode Exact #131206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happens if the 0 count request is in the middle of the list? |
My expectation is that it will always be picked if the subrequests before it were not allocatable. We need testcases for this, if we want to allow it. |
for i, toleration := range request.Tolerations { | ||
allErrs = append(allErrs, validateDeviceToleration(toleration, fldPath.Child("tolerations").Index(i))...) | ||
} | ||
return allErrs | ||
} | ||
|
||
func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path) field.ErrorList { | ||
func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path, subRequest bool) field.ErrorList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than take a bool, would it be more expressive to take a minCount int
here, and then define two consts:
const (
deviceSubRequestExactCountMin = 0
exactDeviceRequestExactCountMin = 1
)
and then pass those consts in in the two functions above
func validateDeviceSubRequest(subRequest resource.DeviceSubRequest, fldPath *field.Path, stored bool) field.ErrorList {
...
allErrs = append(allErrs, validateDeviceAllocationMode(subRequest.AllocationMode, subRequest.Count, fldPath.Child("allocationMode"), fldPath.Child("count"), deviceSubRequestExactCountMin)...)
...
func validateExactDeviceRequest(request resource.ExactDeviceRequest, fldPath *field.Path, stored bool) field.ErrorList {
...
allErrs = append(allErrs, validateDeviceAllocationMode(request.AllocationMode, request.Count, fldPath.Child("allocationMode"), fldPath.Child("count"), exactDeviceRequestExactCountMin)...)
and then finally we can simplify this func
func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path, minCount int) field.ErrorList {
...
case resource.DeviceAllocationModeExactCount:
if count <= minCount {
allErrs = append(allErrs, field.Invalid(countFldPath, count, fmt.Sprintf("must be greater than %d", minCount)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR, so I don't think this suggestion is applicable anymore. It still uses boolean params though, so I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my suggestion holds. I would prefer this:
if count <= minCount {
allErrs = append(allErrs, field.Invalid(countFldPath, count, fmt.Sprintf("must be greater than %d", minCount)))
}
to this:
if allowZeroCount {
if count < 0 {
allErrs = append(allErrs, field.Invalid(countFldPath, count, "must be zero or greater"))
mortent marked this conversation as resolved.
}
} else {
if count <= 0 {
allErrs = append(allErrs, field.Invalid(countFldPath, count, "must be greater than zero"))
}
}
(requires some more work to get there obviously, but hopefully this is clear enough for you to make a determination if you agree)
75ec989
to
6cff267
Compare
@johnbelamaric @pohly I've updated this PR to only allow zero count for the last entry in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach makes sense to me, just some nits on the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code already worked... 😄
Please replace "Fixes: #131194" with "Related-to", then let's close the issue once the KEP and user docs are also updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
API review and implementation.
LGTM label has been added. Git tree hash: 37673ac3234febbc3caa3777fb388d5ef9ad7741
|
allErrs = append(allErrs, validateSetWithIndex(request.FirstAvailable, resource.FirstAvailableDeviceRequestMaxSize, | ||
func(subRequest resource.DeviceSubRequest, fldPath *field.Path, index int) field.ErrorList { | ||
lastInList := index == len(request.FirstAvailable)-1 | ||
return validateDeviceSubRequest(subRequest, fldPath, stored, lastInList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth calling out that this change of validation is okay because it only affects an alpha-level field.
/assign @liggitt |
@tallclair will take a first pass on the API review bit |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaulting code will need to change. I guess you can change it to not default to 1 if it's the last in the list. Is that the desired behavior? Since it's alpha I think the defaults can still be changed.
// Count is used only when the count mode is "ExactCount". Must be greater than zero. | ||
// Count is used only when the count mode is "ExactCount". Must be greater than zero | ||
// unless the subrequest is the last entry in the FirstAvailable list. For the last entry, | ||
// a count of zero is allowed. | ||
// If AllocationMode is ExactCount and this field is not specified, the default is one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it's the last entry? This isn't a pointer type, so I don't think you can tell if it's unspecified or zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was our reasoning behind using ExactCount: 0
vs a new allocation mode, e.g. AllocationMode: None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed a bit in #131194, but essentially we thought that the use-case for setting ExactCount: 0
to be somewhat narrow so adding a new enum for that didn't seem necessary.
But I think the comment from @tallclair here refers to the inaccurate comment on line 810, which seems to suggest that the defaulting logic only applies if the Count
field is not specified. This is not true, since we can distinguish between the field set to 0 and not being set at all. So our defaulting logic kicks in if the field is set to 0. This is not really specific to this PR, but I've updated the comment to be more accurate.
// unless the subrequest is the last entry in the FirstAvailable list. For the last entry, | ||
// a count of zero is allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid to have a single entry with a zero count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This PR initially allowed this, but I don't think we should, so I've updated it to only allow Count: 0
for the last entry if there are more than one entry in the FirstAvailable
list.
The drawback here is that the rule for what is allowed becomes a bit more complicated. But we don't allow Count: 0
for regular requests (without subrequests), so it would be strange to allow this for a single subrequest, which is essentially identical to just having a normal request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should stay consistent.
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent, pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// two subrequests in the FirstAvailable list. | ||
// | ||
// If AllocationMode is ExactCount and Count is zero, it will be defaulted to one unless | ||
// the subrequests meets the criteria for having a zero count described above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be less confusing to have an allocation mode that means "run with no devices"? Something like:
allocationMode: [ExactCount | All | None]
count: only used for ExactCount mode, must be >= 1
and allocationMode: None
is only allowed in the last subrequest, and when there are 2+ subrequests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would avoid needing to do position-dependent defaulting of count
, which is really weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the other fields in DeviceSubRequest have meaning in this case? Or would it be better to add a field to the parent DeviceRequest
that means fall back to no devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is essentially adding optionality into a specific device request at resourceClaim.spec.requests[*].firstAvailable[*]
Is a fallback down inside firstAvailable
the best way to model that, or an Optional *bool
indicator on the parent request? We have precedent for the latter inside various volume types in pods that try to mount a secret or configmap if it exists, but doesn't block if it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. They do not. So, maybe instead, a separate field makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to say it, but probably we should abandon this PR. I think the "weirdness" introduced by allowing 0 count in some cases, without going the "fully optional" route adds complexity that doesn't satisfy a significant use case. If we had CPU management so we could say "give me a bunch of CPUs instead", then we would satisfy the actual fallback use case better, and we wouldn't need this PR for that.
We are working on a CPU driver, though there's still a lot of fuzziness as to exactly how that will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "weirdness" introduced by allowing 0 count in some cases, without going the "fully optional" route adds complexity
Which complexity? In the validation? I agree that the API is not exactly intuitive. Validation now needs to do some extra tracking of slice indexes and sizes, but that doesn't seem too bad to me.
The allocator doesn't get more complicated. It already handled a zero count because we had to solve the "we need to determine the number of devices per request during allocation" problem already because of the alternatives, and "zero devices" isn't special.
satisfy a significant use case
That could be true for now and I am fine with closing the PR because of this. Just beware that if someone then finds a use case, it will be harder to add after beta (validation!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my take-away from this discussion is that if we wanted to add this feature, we want to do it with the Optional *bool
field on the DeviceRequest
. It is more complicated to add to the allocator, but I think @liggitt is correct that it is equivalent to the Count: 0
solution. It is definitely more complicated to add to the allocator (I'm not sure if re-writing to FirstAvailable
is a good solution, but might be worth trying), but the performance impact to the allocator should be the same.
And I think adding an Optional *bool
field is possible after beta, although it does have the risk of introducing regressions in the allocator (ref the discussion on slack).
I do plan to try out the Optional *bool
approach, but I want to get the Mixins feature completed first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm going to take this out of the api-review board until this is ready for another pass. Relabel with api-review when this is active again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/remove-label api-review
@@ -87,6 +87,14 @@ func TestSetDefaultAllocationModeWithSubRequests(t *testing.T) { | |||
}, | |||
}, | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for adding a 2nd device request to these test scenarios? Is it to prove that the first subrequest in the set will behave the same whether or not there are more than one subrequests in the set overall?
E.g., both of the example claim scenarios look like this:
assert.Equal(t, defaultMode, output.Spec.Devices.Requests[0].FirstAvailable[0].AllocationMode)
assert.Equal(t, defaultCount, output.Spec.Devices.Requests[0].FirstAvailable[0].Count)
...
assert.Equal(t, defaultMode, output.Spec.Devices.Requests[1].FirstAvailable[0].AllocationMode)
assert.Equal(t, defaultCount, output.Spec.Devices.Requests[1].FirstAvailable[0].Count)
Are we wanting to prove that Spec.Devices.Requests[n].FirstAvailable[0].AllocationMode
and Spec.Devices.Requests[n].FirstAvailable[0].Count
are always equal?
@liggitt: Those labels are not set on the issue: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This allows users to specify
count: 0
in sub requests on ResourceClaims (and ResourceClaimTemplates). This is useful for supporting scenarios where the last entry inFirstAvailable
is to run the workload without any devices.Which issue(s) this PR fixes:
Related-to: #131194
Special notes for your reviewer:
This is a small enhancements of the Prioritized List feature added in 1.33
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/wg device-management
/assign @pohly
/assign @johnbelamaric