From: David S. Miller Date: Mon, 1 Jun 2020 18:42:01 +0000 (-0700) Subject: Merge branch 'fix-indirect-flow_block-infrastructure' X-Git-Tag: v5.8-rc1~165^2~30 X-Git-Url: https://www.infradead.org/git/?a=commitdiff_plain;h=2aec17f1997b514bc84951f99f7dae9b4b6d8b26;p=users%2Fgriffoul%2Flinux.git Merge branch 'fix-indirect-flow_block-infrastructure' Pablo Neira Ayuso says: ==================== the indirect flow_block infrastructure, revisited This series fixes b5140a36da78 ("netfilter: flowtable: add indr block setup support") that adds support for the indirect block for the flowtable. This patch crashes the kernel with the TC CT action. [ 630.908086] BUG: kernel NULL pointer dereference, address: 00000000000000f0 [ 630.908233] #PF: error_code(0x0000) - not-present page [ 630.908304] PGD 800000104addd067 P4D 800000104addd067 PUD 104311d067 PMD 0 [ 630.908380] Oops: 0000 [#1] SMP PTI [ 630.908615] RIP: 0010:nf_flow_table_indr_block_cb+0xc0/0x190 [nf_flow_table] [ 630.908690] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 75 a0 4c 89 65 a8 4d 89 ee 49 89 dd 4c 89 fe 48 c7 c7 b7 64 36 a0 31 c0 e8 ce ed d8 e0 <49> 8b b7 f0 00 00 00 48 c7 c7 c8 64 36 a0 31 c0 e8 b9 ed d8 e0 49[ 630.908790] RSP: 0018:ffffc9000895f8c0 EFLAGS: 00010246 [...] [ 630.910774] Call Trace: [ 630.911192] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.911621] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.912040] ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core] [ 630.912443] flow_block_cmd+0x51/0x80 [ 630.912844] __flow_indr_block_cb_register+0x26c/0x510 [ 630.913265] mlx5e_nic_rep_netdevice_event+0x9e/0x110 [mlx5_core] [ 630.913665] notifier_call_chain+0x53/0xa0 [ 630.914063] raw_notifier_call_chain+0x16/0x20 [ 630.914466] call_netdevice_notifiers_info+0x39/0x90 [ 630.914859] register_netdevice+0x484/0x550 [ 630.915256] __ip_tunnel_create+0x12b/0x1f0 [ip_tunnel] [ 630.915661] ip_tunnel_init_net+0x116/0x180 [ip_tunnel] [ 630.916062] ipgre_tap_init_net+0x22/0x30 [ip_gre] [ 630.916458] ops_init+0x44/0x110 [ 630.916851] register_pernet_operations+0x112/0x200 A workaround patch to cure this crash has been proposed. However, there is another problem: The indirect flow_block still does not work for the new TC CT action. The problem is that the existing flow_indr_block_entry callback assumes you can look up for the flowtable from the netdevice to get the flow_block. This flow_block allows you to offload the flows via TC_SETUP_CLSFLOWER. Unfortunately, it is not possible to get the flow_block from the TC CT flowtables because they are _not_ bound to any specific netdevice. = What is the indirect flow_block infrastructure? The indirect flow_block infrastructure allows drivers to offload tc/netfilter rules that belong to software tunnel netdevices, e.g. vxlan. This indirect flow_block infrastructure relates tunnel netdevices with drivers because there is no obvious way to relate these two things from the control plane. = How does the indirect flow_block work before this patchset? Front-ends register the indirect block callback through flow_indr_add_block_cb() if they support for offloading tunnel netdevices. == Setting up an indirect block 1) Drivers track tunnel netdevices via NETDEV_{REGISTER,UNREGISTER} events. If there is a new tunnel netdevice that the driver can offload, then the driver invokes __flow_indr_block_cb_register() with the new tunnel netdevice and the driver callback. The __flow_indr_block_cb_register() call iterates over the list of the front-end callbacks. 2) The front-end callback sets up the flow_block_offload structure and it invokes the driver callback to set up the flow_block. 3) The driver callback now registers the flow_block structure and it returns the flow_block back to the front-end. 4) The front-end gets the flow_block object and it is now ready to offload rules for this tunnel netdevice. A simplified callgraph is represented below. Front-end Driver NETDEV_REGISTER | __flow_indr_block_cb_register(netdev, cb_priv, driver_cb) | [1] .--------------frontend_indr_block_cb(cb_priv, driver_cb) | . setup_flow_block_offload(bo) | [2] driver_cb(bo, cb_priv) -----------. | \/ set up flow_blocks [3] | add rules to flow_block <---------- TC_SETUP_CLSFLOWER [4] == Releasing the indirect flow_block There are two possibilities, either tunnel netdevice is removed or a netdevice (port representor) is removed. === Tunnel netdevice is removed Driver waits for the NETDEV_UNREGISTER event that announces the tunnel netdevice removal. Then, it calls __flow_indr_block_cb_unregister() to remove the flow_block and rules. Callgraph is very similar to the one described above. === Netdevice is removed (port representor) Driver calls __flow_indr_block_cb_unregister() to remove the existing netfilter/tc rule that belong to the tunnel netdevice. = How does the indirect flow_block work after this patchset? Drivers register the indirect flow_block setup callback through flow_indr_dev_register() if they support for offloading tunnel netdevices. == Setting up an indirect flow_block 1) Frontends check if dev->netdev_ops->ndo_setup_tc is unset. If so, frontends call flow_indr_dev_setup_offload(). This call invokes the drivers' indirect flow_block setup callback. 2) The indirect flow_block setup callback sets up a flow_block structure which relates the tunnel netdevice and the driver. 3) The front-end uses flow_block and offload the rules. Note that the operational to set up (non-indirect) flow_block is very similar. == Releasing the indirect flow_block === Tunnel netdevice is removed This calls flow_indr_dev_setup_offload() to set down the flow_block and remove the offloaded rules. This alternate path is exercised if dev->netdev_ops->ndo_setup_tc is unset. === Netdevice is removed (port representor) If a netdevice is removed, then it might need to to clean up the offloaded tc/netfilter rules that belongs to the tunnel netdevice: 1) The driver invokes flow_indr_dev_unregister() when a netdevice is removed. 2) This call iterates over the existing indirect flow_blocks and it invokes the cleanup callback to let the front-end remove the tc/netfilter rules. The cleanup callback already provides the flow_block that the front-end needs to clean up. Front-end Driver | flow_indr_dev_unregister(...) | iterate over list of indirect flow_block and invoke cleanup callback | .----------------------------- | . frontend_flow_block_cleanup(flow_block) . | \/ remove rules to flow_block TC_SETUP_CLSFLOWER = About this patchset This patchset aims to address the existing TC CT problem while simplifying the indirect flow_block infrastructure. Saving 300 LoC in the flow_offload core and the drivers. The operational gets aligned with the (non-indirect) flow_blocks logic. Patchset is composed of: Patch #1 add nf_flow_table_gc_cleanup() which is required by the netfilter's flowtable new indirect flow_block approach. Patch #2 adds the flow_block_indr object which is actually part of of the flow_block object. This stores the indirect flow_block metadata such as the tunnel netdevice owner and the cleanup callback (in case the tunnel netdevice goes away). This patch adds flow_indr_dev_{un}register() to allow drivers to offer netdevice tunnel hardware offload to the front-ends. Then, front-ends call flow_indr_dev_setup_offload() to invoke the drivers to set up the (indirect) flow_block. Patch #3 add the tcf_block_offload_init() helper function, this is a preparation patch to adapt the tc front-end to use this new indirect flow_block infrastructure. Patch #4 updates the tc and netfilter front-ends to use the new indirect flow_block infrastructure. Patch #5 updates the mlx5 driver to use the new indirect flow_block infrastructure. Patch #6 updates the nfp driver to use the new indirect flow_block infrastructure. Patch #7 updates the bnxt driver to use the new indirect flow_block infrastructure. Patch #8 removes the indirect flow_block infrastructure version 1, now that frontends and drivers have been translated to version 2 (coming in this patchset). ==================== Signed-off-by: David S. Miller --- 2aec17f1997b514bc84951f99f7dae9b4b6d8b26