Open
Description
For 1.30:
- decide on required kernel/
nft
versions (@danwinship, update client/kernel version requirements for nftables kube-proxy #124152)- discussion in the first half of kube-proxy: change implementation of LoadBalancerSourceRanges for wider kernel support #122296, plus some comments in this issue, and then figure out / document nftables version requirements #122743
- add some useful metrics (@aojea / @npinaeva ?)
- we don't know exactly what these will be: discussion in the KEP
- clean up existing metrics stuff a bit at the same time? (@danwinship, kube-proxy metrics cleanup (and stuff) #124557)
- add performance job (@aojea scalability job for nftables test-infra#32431)
- change
--nodeport-addresses
behavior to default to "primary node IP(s) only" rather than "all node IPs". (@nayihz, change --nodeport-addresses behavior to default to primary node ip only #122724) -
reject
connections on invalid ports of service IPs (@aroradaman, proxy/nftables: reject packets destined for invalid ports of service ips #122692)- Discussion in the KEP
- kube-proxy could watch
ServiceCIDR
objects to learn the full service CIDR(s) and reject connections on all service IPs, not just currently-in-use-ones. - May involve some rewriting of the
@no-endpoints-services
/@no-endpoints-nodeports
handling, in which case note that we're currently checking no-endpoints nodeports from more places than we need to be (Document the nftables kube-proxy packet flow #122687). - Resolve the
UNRESOLVED
section of the KEP after this is implemented.
- add periodic and presubmit e2e tests (add periodic job with nftables proxy test-infra#31525)
- move
danwinship/knftables
tokubernetes-sigs/knftables
(and eventually declare a v0.1.0 API) (@danwinship, REQUEST: Migrate danwinship/knftables to kubernetes-sigs/knftables org#4673, Update knftables, with new sigs.k8s.io module name #122920)- add basic CI (unit tests, gofmt, ...?) to
kubernetes-sigs/knftables
(Add scripts for CI kubernetes-sigs/knftables#3)
- add basic CI (unit tests, gofmt, ...?) to
- drop the
ct state invalid drop
rule (@aroradaman, pkg/proxy/nftables: drop conntrack state invalid rule #122663)- This wasn't discussed in the KEP, but it's fallout from other changes that were happening at the same time; we have a better way of dealing with this bug now (
--conntrack-tcp-be-liberal
) so we should remove the ugly hack from the nftables proxy and push people to use that instead if they need it.
- This wasn't discussed in the KEP, but it's fallout from other changes that were happening at the same time; we have a better way of dealing with this bug now (
- ensure unit test parity with iptables. (I think all that's left here is adding unit tests for the packet tracing code in
helpers_test.go
.) (@npinaeva, Add ParseDump function to allow using Fake.Dump() output as a test setup. kubernetes-sigs/knftables#2, Split regex for map and set elements to enable elements with colon. kubernetes-sigs/knftables#6, Ensure nftables unit test parity with iptables #123389) - decide whether to change NodePorts vs LoadBalancerSourceRanges behavior, and if so, do it. (NO)
- decide if we are going to change anything about session affinity. (NO)
- Discussion in the KEP. I'm not sure if anyone has suggested any particular benefits of the ipvs-like behavior vs the iptables-like behavior.
- other discussion about changes to affinity behavior: test: demote service ClientIP affinity timeout tests from conformance #112806
- also relevant on the LoadBalancer side (loadbalancer tests should not assume particular cloud providers do/don't support particular features #123714); some LB implementations can't implement the current semantics.
- Resolve the
UNRESOLVED
section of the KEP after a decision is made.
- decide if we are using comments well in the ruleset. (I GUESS SO?)
- We don't need many of the comments the iptables ruleset uses because our chain names are self-documenting.
- In general I tried to put comments on objects (chains, sets, maps) rather than on individual rules, but there is some inconsistency. (But also, note that you won't see the chain/set/map comments with kernels < 5.10.)
- Individual rules and set/map elements should have comments in some cases. E.g., elements of
@no-endpoint-services
and@no-endpoint-nodeports
have comments so that you can see which IP/port goes with which service. But elements of@service-ips
and@service-nodeports
don't because you can already figure that out from the names of the chains that they jump to.
Things that depend on us having a perf job (and good metrics) first:
- add iptables-style partial syncing (@npinaeva, [kube-proxy: nftables] Implement partial sync. #126013)
- optimize memory allocation:
- reuse the buffers used for building the
nft -f -
input, like iptables does. This could be done in a few ways. (HaveProxier
keep buffers around like iptables does and pass a buffer tonft.Run()
; haveknftables.realNFTables
keep buffers around itself and pass them to theTransaction
; haveknftables.realNFTables
keepTransaction
s around and reuse them (with the transaction storing its buffer); ...) - maybe reuse the
knftables.Transaction.operation
arrays somehow? - consider optimizing
knftables.Rule
generation wrtknftables.Concat
. Some discussion here.
- reuse the buffers used for building the
- rewrite masquerading to use an nftables set rather than using the mark, as discussed in the KEP
- try out the alternative hairpin rule suggested in the KEP.
Additional metrics for iptables mode to help users figure out if they'd have trouble migrating:
- Metric to indicate if you are using localhost nodeports. This could be done by making sure there is always a separate rule to catch 127.0.0.1:nodeport rather than letting it be caught by a generic rule, and then checking the counters on those rules and exporting some metric with the result. (@aroradaman, Kube-Proxy: Track packets accepted on localhost nodeports #125015)
- More generally, metric to indicate if you are using non-primary-node-IP nodeports. (Maybe the metric could just indicate every IP/interface that you're making use of nodeports on?)
- Metric to indicate if you are relying on the
--ctstate INVALID -j DROP
rule (and should be using--conntrack-tcp-be-liberal
instead) (@aroradaman, Metric to track conntrack state invalid packets dropped by iptables #122812) - Possibly other metrics if we change other behavior
- (We could also add metrics to ipvs mode to help ipvs users, but that's not a blocker for eventually changing the default like the iptables metrics are.)
For 1.31/beta:
- Generally better documentation (mostly in
content/en/docs/reference/networking/virtual-ips.md
in k/website)- Document all changes from iptables, including anything above not mentioned in the KEP.
- Document different interaction with firewalls. (Discussion in KEP)
- Update nftables kube-proxy docs for 1.31 beta website#46541
- periodic e2e test of migration/rollback between iptables/ipvs and nftables
- consider the suggestion of making it possible to run multiple instances of kube-proxy (@uablrek, Multiple kube-proxy instances #122814) (NO)
- Discussion in the KEP PR: KEP-3866: kube-proxy nftables mode enhancements#3824 (comment), KEP-3866: kube-proxy nftables mode enhancements#3824 (comment)
- Resolve the
UNRESOLVED
section of the KEP after a decision is made.
- blog post (nftables kube-proxy blog post for 1.31 website#46969)
For GA:
- Document the "API guarantees" for components interoperating with nftables kube-proxy.
- Discussion in KEP
- Resolve the
UNRESOLVED
section of the KEP when this is implemented.
/sig network
/priority important-soon
/triage accepted
cc @aojea @uablrek @aroradaman @tnqn