]> www.infradead.org Git - users/dwmw2/linux.git/commit
habanalabs: increase timeout during reset
authorOded Gabbay <oded.gabbay@gmail.com>
Fri, 27 Mar 2020 13:38:37 +0000 (16:38 +0300)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Wed, 24 Jun 2020 15:48:47 +0000 (17:48 +0200)
commit044aaaa8b1b15adb397ce423a6d97920a46b3893
tree44a72063383add07c4da59e036edb70a82e6cc31
parentafaff825e3a436f9d1e3986530133b1c91b54cd1
habanalabs: increase timeout during reset

[ Upstream commit 7a65ee046b2238e053f6ebb610e1a082cfc49490 ]

When doing training, the DL framework (e.g. tensorflow) performs hundreds
of thousands of memory allocations and mappings. In case the driver needs
to perform hard-reset during training, the driver kills the application and
unmaps all those memory allocations. Unfortunately, because of that large
amount of mappings, the driver isn't able to do that in the current timeout
(5 seconds). Therefore, increase the timeout significantly to 30 seconds
to avoid situation where the driver resets the device with active mappings,
which sometime can cause a kernel bug.

BTW, it doesn't mean we will spend all the 30 seconds because the reset
thread checks every one second if the unmap operation is done.

Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/misc/habanalabs/habanalabs.h