DRA: Allow subrequests to set count to zero with AllocationMode Exact #131206

mortent · 2025-04-08T14:01:19Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This allows users to specify count: 0 in sub requests on ResourceClaims (and ResourceClaimTemplates). This is useful for supporting scenarios where the last entry in FirstAvailable is to run the workload without any devices.

Which issue(s) this PR fixes:

Related-to: #131194

Special notes for your reviewer:

This is a small enhancements of the Prioritized List feature added in 1.33

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/wg device-management
/assign @pohly
/assign @johnbelamaric

k8s-ci-robot · 2025-04-08T14:01:33Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

johnbelamaric · 2025-04-09T10:08:25Z

What happens if the 0 count request is in the middle of the list?

pohly · 2025-04-09T13:07:11Z

My expectation is that it will always be picked if the subrequests before it were not allocatable. We need testcases for this, if we want to allow it.

pkg/apis/resource/validation/validation.go

jackfrancis · 2025-04-10T15:19:24Z

pkg/apis/resource/validation/validation.go

 	for i, toleration := range request.Tolerations {
 		allErrs = append(allErrs, validateDeviceToleration(toleration, fldPath.Child("tolerations").Index(i))...)
 	}
 	return allErrs
 }

-func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path) field.ErrorList {
+func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path, subRequest bool) field.ErrorList {


rather than take a bool, would it be more expressive to take a minCount int here, and then define two consts:

const ( deviceSubRequestExactCountMin = 0 exactDeviceRequestExactCountMin = 1 )

and then pass those consts in in the two functions above

func validateDeviceSubRequest(subRequest resource.DeviceSubRequest, fldPath *field.Path, stored bool) field.ErrorList { ... allErrs = append(allErrs, validateDeviceAllocationMode(subRequest.AllocationMode, subRequest.Count, fldPath.Child("allocationMode"), fldPath.Child("count"), deviceSubRequestExactCountMin)...) ... func validateExactDeviceRequest(request resource.ExactDeviceRequest, fldPath *field.Path, stored bool) field.ErrorList { ... allErrs = append(allErrs, validateDeviceAllocationMode(request.AllocationMode, request.Count, fldPath.Child("allocationMode"), fldPath.Child("count"), exactDeviceRequestExactCountMin)...)

and then finally we can simplify this func

func validateDeviceAllocationMode(deviceAllocationMode resource.DeviceAllocationMode, count int64, allocModeFldPath, countFldPath *field.Path, minCount int) field.ErrorList { ... case resource.DeviceAllocationModeExactCount: if count <= minCount { allErrs = append(allErrs, field.Invalid(countFldPath, count, fmt.Sprintf("must be greater than %d", minCount)))

I've updated the PR, so I don't think this suggestion is applicable anymore. It still uses boolean params though, so I'm open to suggestions.

I think my suggestion holds. I would prefer this:

if count <= minCount { allErrs = append(allErrs, field.Invalid(countFldPath, count, fmt.Sprintf("must be greater than %d", minCount))) }

to this:

if allowZeroCount { if count < 0 { allErrs = append(allErrs, field.Invalid(countFldPath, count, "must be zero or greater")) mortent marked this conversation as resolved. } } else { if count <= 0 { allErrs = append(allErrs, field.Invalid(countFldPath, count, "must be greater than zero")) } }

(requires some more work to get there obviously, but hopefully this is clear enough for you to make a determination if you agree)

mortent · 2025-04-24T03:03:48Z

@johnbelamaric @pohly I've updated this PR to only allow zero count for the last entry in FirstAvailable, ref the discussion on #131194

pohly

The approach makes sense to me, just some nits on the implementation.

pkg/apis/resource/validation/validation.go

pohly · 2025-04-24T09:01:25Z

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go

The code already worked... 😄

pohly · 2025-04-24T09:03:56Z

Please replace "Fixes: #131194" with "Related-to", then let's close the issue once the KEP and user docs are also updated.

pohly

/lgtm
/approve

API review and implementation.

k8s-ci-robot · 2025-05-08T10:34:17Z

LGTM label has been added.

Git tree hash: 37673ac3234febbc3caa3777fb388d5ef9ad7741

pohly · 2025-05-08T10:51:18Z

pkg/apis/resource/validation/validation.go

+		allErrs = append(allErrs, validateSetWithIndex(request.FirstAvailable, resource.FirstAvailableDeviceRequestMaxSize,
+			func(subRequest resource.DeviceSubRequest, fldPath *field.Path, index int) field.ErrorList {
+				lastInList := index == len(request.FirstAvailable)-1
+				return validateDeviceSubRequest(subRequest, fldPath, stored, lastInList)


It's worth calling out that this change of validation is okay because it only affects an alpha-level field.

mortent · 2025-05-08T21:59:10Z

/assign @liggitt

liggitt · 2025-05-08T22:23:40Z

@tallclair will take a first pass on the API review bit

tallclair · 2025-05-08T22:25:30Z

/assign

tallclair

The defaulting code will need to change. I guess you can change it to not default to 1 if it's the last in the list. Is that the desired behavior? Since it's alpha I think the defaults can still be changed.

staging/src/k8s.io/api/resource/v1beta2/types.go

tallclair · 2025-05-08T23:14:38Z

staging/src/k8s.io/api/resource/v1beta2/types.go

-	// Count is used only when the count mode is "ExactCount". Must be greater than zero.
+	// Count is used only when the count mode is "ExactCount". Must be greater than zero
+	// unless the subrequest is the last entry in the FirstAvailable list. For the last entry,
+	// a count of zero is allowed.
 	// If AllocationMode is ExactCount and this field is not specified, the default is one.


Unless it's the last entry? This isn't a pointer type, so I don't think you can tell if it's unspecified or zero?

what was our reasoning behind using ExactCount: 0 vs a new allocation mode, e.g. AllocationMode: None?

This was discussed a bit in #131194, but essentially we thought that the use-case for setting ExactCount: 0 to be somewhat narrow so adding a new enum for that didn't seem necessary.

But I think the comment from @tallclair here refers to the inaccurate comment on line 810, which seems to suggest that the defaulting logic only applies if the Count field is not specified. This is not true, since we can distinguish between the field set to 0 and not being set at all. So our defaulting logic kicks in if the field is set to 0. This is not really specific to this PR, but I've updated the comment to be more accurate.

tallclair · 2025-05-08T23:20:42Z

staging/src/k8s.io/api/resource/v1beta2/types.go

+	// unless the subrequest is the last entry in the FirstAvailable list. For the last entry,
+	// a count of zero is allowed.


Is it valid to have a single entry with a zero count?

Good point. This PR initially allowed this, but I don't think we should, so I've updated it to only allow Count: 0 for the last entry if there are more than one entry in the FirstAvailable list.

The drawback here is that the rule for what is allowed becomes a bit more complicated. But we don't allow Count: 0 for regular requests (without subrequests), so it would be strange to allow this for a single subrequest, which is essentially identical to just having a normal request.

@pohly

I agree that we should stay consistent.

pkg/apis/resource/validation/validation_resourceclaim_test.go

k8s-ci-robot · 2025-05-09T17:42:27Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2025-05-09T17:42:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent, pohly
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

api/OWNERS
pkg/apis/OWNERS
pkg/generated/openapi/OWNERS
staging/src/k8s.io/api/OWNERS
~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2025-05-09T20:19:19Z

staging/src/k8s.io/api/resource/v1beta1/types.go

+	// two subrequests in the FirstAvailable list.
+	//
+	// If AllocationMode is ExactCount and Count is zero, it will be defaulted to one unless
+	// the subrequests meets the criteria for having a zero count described above.


Would it be less confusing to have an allocation mode that means "run with no devices"? Something like:

allocationMode: [ExactCount | All | None]
count: only used for ExactCount mode, must be >= 1

and allocationMode: None is only allowed in the last subrequest, and when there are 2+ subrequests

That would avoid needing to do position-dependent defaulting of count, which is really weird

Do the other fields in DeviceSubRequest have meaning in this case? Or would it be better to add a field to the parent DeviceRequest that means fall back to no devices?

Yeah, this is essentially adding optionality into a specific device request at resourceClaim.spec.requests[*].firstAvailable[*]

Is a fallback down inside firstAvailable the best way to model that, or an Optional *bool indicator on the parent request? We have precedent for the latter inside various volume types in pods that try to mount a secret or configmap if it exists, but doesn't block if it doesn't exist.

Good question. They do not. So, maybe instead, a separate field makes sense.

I hate to say it, but probably we should abandon this PR. I think the "weirdness" introduced by allowing 0 count in some cases, without going the "fully optional" route adds complexity that doesn't satisfy a significant use case. If we had CPU management so we could say "give me a bunch of CPUs instead", then we would satisfy the actual fallback use case better, and we wouldn't need this PR for that.

We are working on a CPU driver, though there's still a lot of fuzziness as to exactly how that will work.

I think the "weirdness" introduced by allowing 0 count in some cases, without going the "fully optional" route adds complexity

Which complexity? In the validation? I agree that the API is not exactly intuitive. Validation now needs to do some extra tracking of slice indexes and sizes, but that doesn't seem too bad to me.

The allocator doesn't get more complicated. It already handled a zero count because we had to solve the "we need to determine the number of devices per request during allocation" problem already because of the alternatives, and "zero devices" isn't special.

satisfy a significant use case

That could be true for now and I am fine with closing the PR because of this. Just beware that if someone then finds a use case, it will be harder to add after beta (validation!).

So my take-away from this discussion is that if we wanted to add this feature, we want to do it with the Optional *bool field on the DeviceRequest. It is more complicated to add to the allocator, but I think @liggitt is correct that it is equivalent to the Count: 0 solution. It is definitely more complicated to add to the allocator (I'm not sure if re-writing to FirstAvailable is a good solution, but might be worth trying), but the performance impact to the allocator should be the same.

And I think adding an Optional *bool field is possible after beta, although it does have the risk of introducing regressions in the allocator (ref the discussion on slack).

I do plan to try out the Optional *bool approach, but I want to get the Mixins feature completed first.

Ok, I'm going to take this out of the api-review board until this is ready for another pass. Relabel with api-review when this is active again

/remove-label api-review

jackfrancis · 2025-05-16T18:09:55Z

pkg/apis/resource/v1alpha3/defaults_test.go

@@ -87,6 +87,14 @@ func TestSetDefaultAllocationModeWithSubRequests(t *testing.T) {
 							},
 						},
 					},
+					{


What is the reason for adding a 2nd device request to these test scenarios? Is it to prove that the first subrequest in the set will behave the same whether or not there are more than one subrequests in the set overall?

E.g., both of the example claim scenarios look like this:

assert.Equal(t, defaultMode, output.Spec.Devices.Requests[0].FirstAvailable[0].AllocationMode) assert.Equal(t, defaultCount, output.Spec.Devices.Requests[0].FirstAvailable[0].Count) ... assert.Equal(t, defaultMode, output.Spec.Devices.Requests[1].FirstAvailable[0].AllocationMode) assert.Equal(t, defaultCount, output.Spec.Devices.Requests[1].FirstAvailable[0].Count)

Are we wanting to prove that Spec.Devices.Requests[n].FirstAvailable[0].AllocationMode and Spec.Devices.Requests[n].FirstAvailable[0].Count are always equal?

k8s-ci-robot · 2025-05-22T16:28:31Z

@liggitt: Those labels are not set on the issue: api-review

In response to this:

/remove-label api-review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-12T21:27:32Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot assigned johnbelamaric Apr 8, 2025

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 8, 2025

k8s-ci-robot assigned pohly Apr 8, 2025

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Apr 8, 2025

github-project-automation bot added this to Dynamic Resource Allocation Apr 8, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Apr 8, 2025

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 8, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Apr 8, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Apr 8, 2025

k8s-ci-robot requested review from klueska and pohly April 8, 2025 14:01

mortent mentioned this pull request Apr 8, 2025

DRA Prioritized List: allow alternative with "no devices" #131194

Open

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Apr 9, 2025

bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Apr 9, 2025

jackfrancis reviewed Apr 10, 2025

View reviewed changes

pkg/apis/resource/validation/validation.go Show resolved Hide resolved

jackfrancis reviewed Apr 10, 2025

View reviewed changes

mortent force-pushed the AllowZeroCountInSubRequests branch from 75ec989 to 6cff267 Compare April 24, 2025 02:55

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 24, 2025

pohly suggested changes Apr 24, 2025

View reviewed changes

mortent requested a review from pohly May 7, 2025 20:29

pohly approved these changes May 8, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2025

pohly reviewed May 8, 2025

View reviewed changes

k8s-ci-robot assigned liggitt May 8, 2025

liggitt added this to API Reviews May 8, 2025

liggitt moved this to Assigned in API Reviews May 8, 2025

k8s-ci-robot assigned tallclair May 8, 2025

tallclair reviewed May 8, 2025

View reviewed changes

mortent added 3 commits May 9, 2025 16:52

Fix defaulting

4a70513

Run make update

b76321e

Update defaulting tests and validation

61affde

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2025

k8s-ci-robot requested review from johnbelamaric, liggitt, pohly and tallclair May 9, 2025 17:42

liggitt reviewed May 9, 2025

View reviewed changes

liggitt moved this from Assigned to Changes requested in API Reviews May 14, 2025

jackfrancis reviewed May 16, 2025

View reviewed changes

liggitt removed this from API Reviews May 22, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2025

		// unless the subrequest is the last entry in the FirstAvailable list. For the last entry,
		// a count of zero is allowed.

DRA: Allow subrequests to set count to zero with AllocationMode Exact #131206

Are you sure you want to change the base?

DRA: Allow subrequests to set count to zero with AllocationMode Exact #131206

Conversation

mortent commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 8, 2025

Uh oh!

johnbelamaric commented Apr 9, 2025

Uh oh!

pohly commented Apr 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mortent commented Apr 24, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mortent commented May 8, 2025

Uh oh!

liggitt commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tallclair commented May 8, 2025

Uh oh!

tallclair left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented May 9, 2025

Uh oh!

k8s-ci-robot commented May 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mortent commented Apr 8, 2025 •

edited

Loading

pohly commented Apr 24, 2025 •

edited

Loading

liggitt commented May 8, 2025 •

edited

Loading