You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of the Llama3 end-to-end testing on blackhole: #18135
We rely on DRAM sharded matmuls for our 3 MLP MMs.
Our codebase parametrizes most of our memory configurations, and for DRAM-sharded, it expects 12 DRAM cores, but I think BH only has 8.
Changing this I see the following error when creating tensor with sharded specs:
E RuntimeError: TT_FATAL @ /localdev/mtairum/tt-metal/ttnn/cpp/ttnn/operations/data_movement/sharded/interleaved_to_sharded/device/interleaved_to_sharded_op.cpp:21: this->output_mem_config.buffer_type == BufferType::L1
How to reproduce
I found we have a unit test that's a good proxy for the issue above:
The dram_core_range_set is set to 12 cores, so some changes will be needed here. Below is a change to the test above, with core range to be 8 cores, and the shardSpecs to shard over the 8 cores.
When the above is passing, you can test the full MLP block used in Llama.
Will need to change conftest to accept blackhole, example in this commit: main...mtairum/llama3-blackhole
export HF_MODEL=meta-llama/Llama-3.1-8B-Instruct
#export HF_MODEL=meta-llama/Llama-3.2-1B-Instruct # Also try 1B/3B, has it has different shapes and should also be supported.
# Run the decode test
pytest models/demos/llama3/tests/test_llama_mlp.py -k 32-1
# Run the prefill test
pytest models/demos/llama3/tests/test_llama_mlp.py -k 32768-1
Full error traceback
_________________________________________________________________________ test_llama_mlp_width_sharded_to_interleaved_pcc_err[32] __________________________________________________________________________
tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py:2468: in test_llama_mlp_width_sharded_to_interleaved_pcc_err
tt_w1 = as_sharded_tensor(w1.t(), ttnn.bfloat8_b, dim=-1, mem_config=w1_w3_mem_config)
tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py:2460: in <lambda>
as_sharded_tensor = lambda w, type, dim, mem_config: ttnn.as_tensor(
ttnn/ttnn/decorators.py:333: in __call__
return self.function(*function_args, **function_kwargs)
ttnn/ttnn/operations/core.py:600: in as_tensor
return torch_to_ttnn(tensor, dtype, layout, device, memory_config, mesh_mapper)
ttnn/ttnn/operations/core.py:589: in torch_to_ttnn
tensor = ttnn.from_torch(
ttnn/ttnn/decorators.py:333: in __call__
return self.function(*function_args, **function_kwargs)
ttnn/ttnn/operations/core.py:237: in from_torch
tensor = ttnn.to_device(tensor, device, memory_config=memory_config, cq_id=cq_id)
ttnn/ttnn/decorators.py:333: in __call__
return self.function(*function_args, **function_kwargs)
E RuntimeError: TT_FATAL @ /localdev/mtairum/tt-metal/ttnn/cpp/ttnn/operations/data_movement/sharded/interleaved_to_sharded/device/interleaved_to_sharded_op.cpp:21: this->output_mem_config.buffer_type == BufferType::L1
E info:
E Error
E backtrace:
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xe96b09) [0x7ff698547b09]
E --- ttnn::operations::data_movement::InterleavedToShardedDeviceOperation::validate(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail23launch_on_worker_threadIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEENS3_3stl10StrongTypeIhNS_10QueueIdTagEEElNS5_15DeviceOperationISC_EENSD_13tensor_args_tESC_PNS4_2v07IDeviceEEEvT0_T1_RKT2_RKT3_RT4_RT5_+0x398) [0x7ff69a7a51e8]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail23launch_on_single_deviceIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEEEENT_21tensor_return_value_tENS3_3stl10StrongTypeIhNS_10QueueIdTagEEERKNSE_22operation_attributes_tERKNSE_13tensor_args_tE+0xe4) [0x7ff69a7a4d24]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f3b9f) [0x7ff69a7a4b9f]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail6invokeIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEEEENT_21tensor_return_value_tENS3_3stl10StrongTypeIhNS_10QueueIdTagEEERKNSE_22operation_attributes_tERKNSE_13tensor_args_tE+0x1f6) [0x7ff69a7a4496]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f2d6f) [0x7ff69a7a3d6f]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f2985) [0x7ff69a7a3985]
E --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&, tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>)
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf1a7e3) [0x7ff6985cb7e3]
E --- ttnn::operations::data_movement::InterleavedToShardedOperation::invoke(tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>, tt::tt_metal::Tensor const&, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::DataType> const&, std::__1::optional<bool> const&)
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf77124) [0x7ff698628124]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76eeb) [0x7ff698627eeb]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76e42) [0x7ff698627e42]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x3101b75) [0x7ff69a7b2b75]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x31022c7) [0x7ff69a7b32c7]
E --- /localdev/mtairum/tt-metal/build_Release_tracy/lib/libtt_metal.so(+0x19c02f) [0x7ff696d1f02f]
E --- void tt::tt_metal::operation::launch_op_func<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf7693a) [0x7ff69862793a]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76277) [0x7ff698627277]
E --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf75cf8) [0x7ff698626cf8]
E --- ttnn::operations::core::to_device(tt::tt_metal::Tensor const&, tt::tt_metal::v0::IDevice*, std::__1::optional<tt::tt_metal::MemoryConfig> const&, tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>)
The text was updated successfully, but these errors were encountered:
Added changes to tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py::test_llama_mlp_width_sharded_to_interleaved_pcc_err to use how many cores the target architecture has, and updated the shard specs to account for 8 dram cores instead of 12.
When this test is passing will need to update it to come up with the shard spec based on the #dram_cores
Part of the Llama3 end-to-end testing on blackhole: #18135
We rely on DRAM sharded matmuls for our 3 MLP MMs.
Our codebase parametrizes most of our memory configurations, and for DRAM-sharded, it expects 12 DRAM cores, but I think BH only has 8.
Changing this I see the following error when creating tensor with sharded specs:
E RuntimeError: TT_FATAL @ /localdev/mtairum/tt-metal/ttnn/cpp/ttnn/operations/data_movement/sharded/interleaved_to_sharded/device/interleaved_to_sharded_op.cpp:21: this->output_mem_config.buffer_type == BufferType::L1
How to reproduce
I found we have a unit test that's a good proxy for the issue above:
pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py::test_llama_mlp_width_sharded_to_interleaved_pcc_err
The
dram_core_range_set
is set to 12 cores, so some changes will be needed here. Below is a change to the test above, with core range to be 8 cores, and the shardSpecs to shard over the 8 cores.When the above is passing, you can test the full MLP block used in Llama.
Will need to change conftest to accept blackhole, example in this commit: main...mtairum/llama3-blackhole
Full error traceback
The text was updated successfully, but these errors were encountered: