Einsum fails on Triton-ONNX-Runtime
When exporting this model into ONNX and serving it on NVIDIA Triton with ONNX-Runtime backend I get the following error:
2025-02-11 15:44:11.410228203 [E:onnxruntime:log, cuda_call.cc:123 CudaCall] CUBLAS failure 7: CUBLAS_STATUS_INVALID_VALUE ; GPU=0 ; hostname=9b6c766f16ea ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/math/einsum_utils/einsum_auxiliary_ops.cc ; line=54 ; expr=cublasGemmStridedBatchedHelper( static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cublas_handle_, CUBLAS_OP_N, CUBLAS_OP_N, static_cast(N), static_cast(M), static_cast(K), &one, reinterpret_cast<const CudaT*>(input_2_data), static_cast(N), static_cast(right_stride), reinterpret_cast<const CudaT*>(input_1_data), static_cast(K), static_cast(left_stride), &zero, reinterpret_cast<CudaT*>(output_data), static_cast(N), static_cast(output_stride), static_cast(num_batches), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->GetDeviceProp(), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->UseTF32());
2025-02-11 15:44:11.432230090 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Einsum node. Name:'/Einsum' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc:341 std::unique_ptronnxruntime::Tensor onnxruntime::EinsumOp::MatMul(const onnxruntime::Tensor&, const gsl::span&, const onnxruntime::Tensor&, const gsl::span&, onnxruntime::AllocatorPtr, onnxruntime::concurrency::ThreadPool*, void*, DeviceHelpers::MatMul&) [with T = float; onnxruntime::AllocatorPtr = std::shared_ptronnxruntime::IAllocator; DeviceHelpers::MatMul = std::function<onnxruntime::common::Status(const float*, const float*, float*, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, onnxruntime::concurrency::ThreadPool*, void*)>] 21Einsum op: Exception during MatMul operation: CUBLAS failure 7: CUBLAS_STATUS_INVALID_VALUE ; GPU=0 ; hostname=9b6c766f16ea ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/math/einsum_utils/einsum_auxiliary_ops.cc ; line=54 ; expr=cublasGemmStridedBatchedHelper( static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cublas_handle_, CUBLAS_OP_N, CUBLAS_OP_N, static_cast(N), static_cast(M), static_cast(K), &one, reinterpret_cast<const CudaT*>(input_2_data), static_cast(N), static_cast(right_stride), reinterpret_cast<const CudaT*>(input_1_data), static_cast(K), static_cast(left_stride), &zero, reinterpret_cast<CudaT*>(output_data), static_cast(N), static_cast(output_stride), static_cast(num_batches), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->GetDeviceProp(), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->UseTF32());
Was anyone able to serve the onnx model on Triton?
Thanks