- Stalls at the beginning of training: W&B’s multiprocessing can interfere with the multiprocessing from distributed training frameworks.
- Stalls at the end of training: The W&B process doesn’t detect when to exit.
Fix hangs at the start
If your run stalls as training begins, the cause is usually a conflict between W&B’s multiprocessing and the distributed training framework’s multiprocessing. To resolve this, enable W&B Service, which is the default for W&B SDK0.13.0 and later. If you’re on an older version, upgrade your SDK:
0.12.5 through 0.12.x, enable W&B Service explicitly:
0.12.4 and earlier, set the WANDB_START_METHOD environment variable:
Fix hangs at the end
If your run stalls after training completes, W&B doesn’t detect that the run is finished. Callwandb.finish() at the end of your training script to signal to W&B that the run is complete:
Experiments Run Crashes