You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2025-02-13 12:14:38,487] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[INFO|2025-02-13 12:14:40] llamafactory.cli:157 >> Initializing distributed tasks at: 10.126.218.43:29500
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792]
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
Reproduction
Put your message here.
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
问题描述
使用0.9.2版本库训练deepseek3会卡住在这个位置,然后更换qwen 7b模型训练发现也卡在这个位置,之后使用之前的容器环境0.9.1版本,相同启动命令和配置文件,是可以正常训练的;目前怀疑是环境中多机交互相关库,如deepspeed等版本问题,请问0.9.2更新的版本训练deepseek3,我的依赖库环境版本有问题吗
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
llamafactory
version: 0.9.2.dev0[2025-02-13 12:14:38,487] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
[INFO|2025-02-13 12:14:40] llamafactory.cli:157 >> Initializing distributed tasks at: 10.126.218.43:29500
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792]
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
Reproduction
Others
No response
The text was updated successfully, but these errors were encountered: