Unsolved
1 Rookie
•
1 Message
0
13
October 22nd, 2025 17:36
G16-7630 RTX 4060 Win11&Ubuntu 22.04LTS - GPU crash during Model training
I am experiencing a sudden and severe GPU issue that leads to GPU crashes under load. This problem started occurring recently, while the same workloads used to run fine before. The issue is consistent across Ubuntu, I suppose pointing to a hardware failure.
System Configuration
Model:
G16 7630OS: Dual-Boot: Windows 11 & Ubuntu 22.04 LTS (Kernel 6.5)
GPU: NVIDIA GeForce RTX
4060NVIDIA Driver: Tried both 535 and 570 - No difference
Failure Description
The GPU goes to 87-93°C followed by a crash.
Trigger: GPU-intensive tasks (gaming in Windows with a External Monitor, small model training in Linux).
GPU Temperature: Consistently reaches 90-93°C before crashing.
The "External Monitor" Symptom :
WITH External Monitor: The entire laptop screen goes black and unresponsive. Requires forced shutdown.
WITHOUT External Monitor (Using Laptop Screen Only in Ubuntu): The training process freezes, but the laptop itself remains usable for web browsing etc.
My model is small, GPU memory usage is normal (no overflow), and the same batch size worked perfectly before yesterday. But when I reduce the batch-size to half it works fun. It also works fun when my Laptop is not charged.
Does anyone known what is the problem ?
The following are logs:
@robotAI:~$ sudo dmesg -T | grep -E "nvidia|gpu|NVRM|PCIe" | tail -30
[Thu Oct 23 00:42:20 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Thu Oct 23 00:42:20 2025] nvidia: module license taints kernel.
[Thu Oct 23 00:42:20 2025] audit: type=1400 audit(1761151341.580:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=752 comm="apparmor_parser"
[Thu Oct 23 00:42:20 2025] audit: type=1400 audit(1761151341.580:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=752 comm="apparmor_parser"
[Thu Oct 23 00:42:20 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[Thu Oct 23 00:42:21 2025] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[Thu Oct 23 00:42:21 2025] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[Thu Oct 23 00:42:21 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.274.02 Thu Sep 4 22:13:52 UTC 2025
[Thu Oct 23 00:42:21 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.274.02 Thu Sep 4 22:13:13 UTC 2025
[Thu Oct 23 00:42:21 2025] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[Thu Oct 23 00:42:22 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:22 2025] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
[Thu Oct 23 00:42:22 2025] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
[Thu Oct 23 00:42:22 2025] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[Thu Oct 23 00:42:22 2025] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[Thu Oct 23 00:42:22 2025] nvidia-uvm: Loaded the UVM driver, major device number 508.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:23 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:40 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:40 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:42:40 2025] nvidia-modeset: nvidia-modeset: ACPI reported no NVIDIA native backlight available; attempting to use ACPI backlight.
[Thu Oct 23 00:44:13 2025] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Thu Oct 23 00:44:13 2025] NVRM: GPU at PCI:0000:01:00: GPU-225664c6-617f-e3c5-06d8-64c4a60c494e
[Thu Oct 23 00:44:13 2025] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Thu Oct 23 00:44:13 2025] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[Thu Oct 23 00:44:18 2025] NVRM: Error in service of callback
@robotAI:~$ sudo dmesg -T | grep -E "reset|error|fail|timeout" | grep -i pci
[Thu Oct 23 00:42:18 2025] pci 10000:e0:1a.0: BAR 13: failed to assign [io size 0x1000]
[Thu Oct 23 00:42:18 2025] pci 10000:e0:1b.4: BAR 13: failed to assign [io size 0x1000]
[Thu Oct 23 00:44:12 2025] pcieport 0000:00:01.0: AER: Multiple Corrected error message received from 0000:00:01.0
[Thu Oct 23 00:44:13 2025] pcieport 0000:00:01.0: device [8086:a70d] error status/mask=00002001/00002000


