Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAM width sharded tensors failing on blackhole #18144

Open
mtairum opened this issue Feb 21, 2025 · 3 comments
Open

DRAM width sharded tensors failing on blackhole #18144

mtairum opened this issue Feb 21, 2025 · 3 comments
Assignees
Labels
blackhole bug Something isn't working llama3 P0

Comments

@mtairum
Copy link
Contributor

mtairum commented Feb 21, 2025

Part of the Llama3 end-to-end testing on blackhole: #18135

We rely on DRAM sharded matmuls for our 3 MLP MMs.

Our codebase parametrizes most of our memory configurations, and for DRAM-sharded, it expects 12 DRAM cores, but I think BH only has 8.
Changing this I see the following error when creating tensor with sharded specs:

E RuntimeError: TT_FATAL @ /localdev/mtairum/tt-metal/ttnn/cpp/ttnn/operations/data_movement/sharded/interleaved_to_sharded/device/interleaved_to_sharded_op.cpp:21: this->output_mem_config.buffer_type == BufferType::L1

How to reproduce

I found we have a unit test that's a good proxy for the issue above:

pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py::test_llama_mlp_width_sharded_to_interleaved_pcc_err

The dram_core_range_set is set to 12 cores, so some changes will be needed here. Below is a change to the test above, with core range to be 8 cores, and the shardSpecs to shard over the 8 cores.

    dram_core_range_set = ttnn.CoreRangeSet(
        {
            ttnn.CoreRange(
                ttnn.CoreCoord(0, 0),
                ttnn.CoreCoord(device.dram_grid_size().x - 1, device.dram_grid_size().y - 1),
                # ttnn.CoreCoord(11, 0),
            ),
        }
    )
    w1_w3_mem_config = ttnn.MemoryConfig(
        ttnn.TensorMemoryLayout.WIDTH_SHARDED,
        ttnn.BufferType.DRAM,
        ttnn.ShardSpec(dram_core_range_set, (4096, 448), ttnn.ShardOrientation.ROW_MAJOR),
    )
    w2_mem_config = ttnn.MemoryConfig(
        ttnn.TensorMemoryLayout.WIDTH_SHARDED,
        ttnn.BufferType.DRAM,
        ttnn.ShardSpec(dram_core_range_set, (3584, 512), ttnn.ShardOrientation.ROW_MAJOR),
    )

When the above is passing, you can test the full MLP block used in Llama.
Will need to change conftest to accept blackhole, example in this commit: main...mtairum/llama3-blackhole

export HF_MODEL=meta-llama/Llama-3.1-8B-Instruct
#export HF_MODEL=meta-llama/Llama-3.2-1B-Instruct  # Also try 1B/3B, has it has different shapes and should also be supported. 

# Run the decode test
pytest models/demos/llama3/tests/test_llama_mlp.py -k 32-1
# Run the prefill test
pytest models/demos/llama3/tests/test_llama_mlp.py -k 32768-1

Full error traceback

_________________________________________________________________________ test_llama_mlp_width_sharded_to_interleaved_pcc_err[32] __________________________________________________________________________
tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py:2468: in test_llama_mlp_width_sharded_to_interleaved_pcc_err
    tt_w1 = as_sharded_tensor(w1.t(), ttnn.bfloat8_b, dim=-1, mem_config=w1_w3_mem_config)
tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py:2460: in <lambda>
    as_sharded_tensor = lambda w, type, dim, mem_config: ttnn.as_tensor(
ttnn/ttnn/decorators.py:333: in __call__
    return self.function(*function_args, **function_kwargs)
ttnn/ttnn/operations/core.py:600: in as_tensor
    return torch_to_ttnn(tensor, dtype, layout, device, memory_config, mesh_mapper)
ttnn/ttnn/operations/core.py:589: in torch_to_ttnn
    tensor = ttnn.from_torch(
ttnn/ttnn/decorators.py:333: in __call__
    return self.function(*function_args, **function_kwargs)
ttnn/ttnn/operations/core.py:237: in from_torch
    tensor = ttnn.to_device(tensor, device, memory_config=memory_config, cq_id=cq_id)
ttnn/ttnn/decorators.py:333: in __call__
    return self.function(*function_args, **function_kwargs)
E   RuntimeError: TT_FATAL @ /localdev/mtairum/tt-metal/ttnn/cpp/ttnn/operations/data_movement/sharded/interleaved_to_sharded/device/interleaved_to_sharded_op.cpp:21: this->output_mem_config.buffer_type == BufferType::L1
E   info:
E   Error
E   backtrace:
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xe96b09) [0x7ff698547b09]
E    --- ttnn::operations::data_movement::InterleavedToShardedDeviceOperation::validate(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail23launch_on_worker_threadIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEENS3_3stl10StrongTypeIhNS_10QueueIdTagEEElNS5_15DeviceOperationISC_EENSD_13tensor_args_tESC_PNS4_2v07IDeviceEEEvT0_T1_RKT2_RKT3_RT4_RT5_+0x398) [0x7ff69a7a51e8]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail23launch_on_single_deviceIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEEEENT_21tensor_return_value_tENS3_3stl10StrongTypeIhNS_10QueueIdTagEEERKNSE_22operation_attributes_tERKNSE_13tensor_args_tE+0xe4) [0x7ff69a7a4d24]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f3b9f) [0x7ff69a7a4b9f]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn16device_operation6detail6invokeIN2tt8tt_metal9operation23OldInfraDeviceOperationINSt3__16vectorINS4_6TensorENS7_9allocatorIS9_EEEEEEEENT_21tensor_return_value_tENS3_3stl10StrongTypeIhNS_10QueueIdTagEEERKNSE_22operation_attributes_tERKNSE_13tensor_args_tE+0x1f6) [0x7ff69a7a4496]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f2d6f) [0x7ff69a7a3d6f]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x30f2985) [0x7ff69a7a3985]
E    --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&, tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>)
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf1a7e3) [0x7ff6985cb7e3]
E    --- ttnn::operations::data_movement::InterleavedToShardedOperation::invoke(tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>, tt::tt_metal::Tensor const&, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::DataType> const&, std::__1::optional<bool> const&)
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf77124) [0x7ff698628124]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76eeb) [0x7ff698627eeb]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76e42) [0x7ff698627e42]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x3101b75) [0x7ff69a7b2b75]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0x31022c7) [0x7ff69a7b32c7]
E    --- /localdev/mtairum/tt-metal/build_Release_tracy/lib/libtt_metal.so(+0x19c02f) [0x7ff696d1f02f]
E    --- void tt::tt_metal::operation::launch_op_func<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf7693a) [0x7ff69862793a]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf76277) [0x7ff698627277]
E    --- /localdev/mtairum/tt-metal/ttnn/ttnn/_ttnn.so(+0xf75cf8) [0x7ff698626cf8]
E    --- ttnn::operations::core::to_device(tt::tt_metal::Tensor const&, tt::tt_metal::v0::IDevice*, std::__1::optional<tt::tt_metal::MemoryConfig> const&, tt::stl::StrongType<unsigned char, ttnn::QueueIdTag>)    
@bbradelTT
Copy link
Contributor

The failure is in i2s. Reassigning to @llongTT on @ntarafdar 's team.

@mtairum mtairum changed the title DRAM-sharded matmul failing on blackhole Creating sharded tensor failing on blackhole Feb 21, 2025
@llongTT
Copy link
Contributor

llongTT commented Feb 21, 2025

Looking into it.

@mtairum mtairum changed the title Creating sharded tensor failing on blackhole DRAM width sharded tensors failing on blackhole Feb 21, 2025
@mtairum
Copy link
Contributor Author

mtairum commented Feb 21, 2025

Added changes to tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py::test_llama_mlp_width_sharded_to_interleaved_pcc_err to use how many cores the target architecture has, and updated the shard specs to account for 8 dram cores instead of 12.

When this test is passing will need to update it to come up with the shard spec based on the #dram_cores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blackhole bug Something isn't working llama3 P0
Projects
None yet
Development

No branches or pull requests

3 participants