微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

Excuses123 · 2024-08-28T09:46:13Z

Reminder

I have read the README and searched the existing issues.

System Info

####训练参数如下

model

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: identity,alpaca_en_demo
eval_dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/full/sft
report_to: tensorboard
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

do_eval: true
predict_with_generate: true
#val_size: 0.1
per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500

Reproduction

报错信息如下

***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >> Num examples = 91
[INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >> Batch size = 1
[rank2]: Traceback (most recent call last):
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in
[rank2]: main()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main
[rank2]: run_exp()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft
[rank2]: metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate
[rank2]: output = eval_loop(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop
[rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step
[rank2]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step
[rank2]: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
[rank2]: result = self._sample(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
[rank2]: outputs = self(**model_inputs, return_dict=True)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
[rank2]: outputs = self.model(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
[rank2]: layer_outputs = decoder_layer(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
[rank2]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward
[rank2]: attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3. Target sizes: [1, 32, 1, 32]. Tensor sizes: [1, 1, 1, 31]

Expected behavior

运行命令：

CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml

Others

No response

Excuses123 · 2024-08-28T09:48:50Z

试了qwen2模型也会报相同的错，相同模型使用lora和qlora微调时，就可以正常运行；使用Bloomz系列模型进行全量微调也可以正常运行

Excuses123 · 2024-08-29T03:18:25Z

经过测试，这个报错是由于使用了deepspeed zero3造成的，使用该模式在predict_with_generate=True验证时候会报错

hiyouga · 2024-08-29T12:35:33Z

不支持 DeepSpeed zero3

Excuses123 · 2024-08-29T12:38:55Z

不支持 DeepSpeed zero3

是llama、qwen模型本身不支持吗？因为有的模型可以成功跑下来像bloomz系列的

whybeyoung · 2024-09-18T08:07:57Z

deepspeed2 似乎也不支持啊

github-actions bot added the pending This problem is yet to be addressed label Aug 28, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Aug 29, 2024

hiyouga closed this as completed in aa1afdc Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

Excuses123 commented Aug 28, 2024

Excuses123 commented Aug 28, 2024

Excuses123 commented Aug 29, 2024

hiyouga commented Aug 29, 2024

Excuses123 commented Aug 29, 2024

whybeyoung commented Sep 18, 2024

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

Comments

Excuses123 commented Aug 28, 2024

Reminder

System Info

model

method

dataset

output

train

eval

Reproduction

报错信息如下

Expected behavior

Others

Excuses123 commented Aug 28, 2024

Excuses123 commented Aug 29, 2024

hiyouga commented Aug 29, 2024

Excuses123 commented Aug 29, 2024

whybeyoung commented Sep 18, 2024