We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
####训练参数如下
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json
dataset: identity,alpaca_en_demo eval_dataset: identity template: llama3 cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves/llama3-8b/full/sft report_to: tensorboard logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 2 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
do_eval: true predict_with_generate: true #val_size: 0.1 per_device_eval_batch_size: 1 #eval_strategy: steps #eval_steps: 500
***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >> Num examples = 91 [INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >> Batch size = 1 [rank2]: Traceback (most recent call last): [rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in [rank2]: main() [rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main [rank2]: run_exp() [rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp [rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft [rank2]: metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate [rank2]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate [rank2]: output = eval_loop( [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop [rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) [rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step [rank2]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step [rank2]: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank2]: return func(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate [rank2]: result = self._sample( [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample [rank2]: outputs = self(**model_inputs, return_dict=True) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank2]: return self._call_impl(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl [rank2]: result = forward_call(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward [rank2]: outputs = self.model( [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank2]: return self._call_impl(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl [rank2]: result = forward_call(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward [rank2]: layer_outputs = decoder_layer( [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank2]: return self._call_impl(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl [rank2]: result = forward_call(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward [rank2]: hidden_states, self_attn_weights, present_key_value = self.self_attn( [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank2]: return self._call_impl(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl [rank2]: result = forward_call(*args, **kwargs) [rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward [rank2]: attn_output = torch.nn.functional.scaled_dot_product_attention( [rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3. Target sizes: [1, 32, 1, 32]. Tensor sizes: [1, 1, 1, 31]
运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml
No response
The text was updated successfully, but these errors were encountered:
试了qwen2模型也会报相同的错,相同模型使用lora和qlora微调时,就可以正常运行;使用Bloomz系列模型进行全量微调也可以正常运行
Sorry, something went wrong.
经过测试,这个报错是由于使用了deepspeed zero3造成的,使用该模式在predict_with_generate=True验证时候会报错
不支持 DeepSpeed zero3
aa1afdc
是llama、qwen模型本身不支持吗?因为有的模型可以成功跑下来像bloomz系列的
deepspeed2 似乎也不支持啊
No branches or pull requests
Reminder
System Info
####训练参数如下
model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: identity,alpaca_en_demo
eval_dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/llama3-8b/full/sft
report_to: tensorboard
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
do_eval: true
predict_with_generate: true
#val_size: 0.1
per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500
Reproduction
报错信息如下
***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >> Num examples = 91
[INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >> Batch size = 1
[rank2]: Traceback (most recent call last):
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in
[rank2]: main()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main
[rank2]: run_exp()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft
[rank2]: metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate
[rank2]: output = eval_loop(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop
[rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step
[rank2]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step
[rank2]: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
[rank2]: result = self._sample(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
[rank2]: outputs = self(**model_inputs, return_dict=True)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
[rank2]: outputs = self.model(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
[rank2]: layer_outputs = decoder_layer(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
[rank2]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward
[rank2]: attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3. Target sizes: [1, 32, 1, 32]. Tensor sizes: [1, 1, 1, 31]
Expected behavior
运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml
Others
No response
The text was updated successfully, but these errors were encountered: