Concat operator fusion leads to inconsistent inference results between CPU and GPU.

### System Information

OpenCV version: 4.8.0 / newest 4.x
Operating System / Platform: Ubuntu 20.04
CUDA:11.8
The graphics card model : NVIDIA GeForce GTX 3090

### Detailed description

Related #24606 #23977 #24635 

I found the issue is caused by [this line](https://github.com/opencv/opencv/blob/283407e1ff6076a508302a68d350704299b46e6b/modules/dnn/src/net_impl_fuse.cpp#L740C78-L740C78), when I commented it, my programe got consistent results.

During discovering above solution, I observed some facts. As shown below, after the fusion of the concat operator, the results of the Mul and Sigmoid operators are stored in [contiguous memory](https://github.com/opencv/opencv/blob/283407e1ff6076a508302a68d350704299b46e6b/modules/dnn/src/net_impl_fuse.cpp#L702C13-L702C13). However, in some case, Mul cannot use the CUDA backend, [for example](https://github.com/opencv/opencv/blob/283407e1ff6076a508302a68d350704299b46e6b/modules/dnn/src/layers/nary_eltwise_layers.cpp#L805C43-L805C43), Mul use defaults backend, the result of Mul is stored in host memory, while the result of Sigmoid is stored in device memory. Subsequent operations will only use either the Mul or Sigmoid result. Because Mul and Sigmoid indirectly call the function `setHostDirty` or `setDeviceDirty` after the operator computations are completed, allowing only one function to take effect finally on contiguous memory. 

I think my solution is only temporary. Is there a way to fix it permanently?

![yolov8](https://github.com/opencv/opencv/assets/32591058/b7a1f04a-8fa5-4356-b538-b4bed6f927ba)


### Steps to reproduce

export onnx

```python
import torch
import torch.nn as nn

class MulMat(nn.Module):
    
    def __init__(self):
        super(MulMat, self).__init__()
        self.yy = torch.randn(1, 8400)
        self.xx = torch.randn(1, 4, 8400)
        self.xxx = torch.randn(1, 4, 8400)
    def forward(self, x):
        x1= x + self.xx
        x2= x + self.xxx
        x3 = x2 * self.yy
        x4 = torch.sigmoid(x1)
        x5 = torch.cat((x3, x4), 1)
        return x5

m = MulMat()

torch.onnx.export(m,
             (torch.randn(1, 4, 8400)),
             'mulmat.onnx',
             export_params=True,
             opset_version=11,
             input_names = ['input0'],
             output_names=['output0'],            
             )
```

opencv run cpu and cuda backends

```python
import cv2.dnn
import numpy as np

def main(onnx_model):
    # Load the ONNX model
    # CPU
    model: cv2.dnn.Net = cv2.dnn.readNetFromONNX(onnx_model)
    # CUDA
    model_cuda: cv2.dnn.Net = cv2.dnn.readNetFromONNX(onnx_model)
    model_cuda.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
    model_cuda.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

    np.random.seed(13)
    blob = np.random.randint(0, 255, (1, 4, 8400), dtype=np.uint8) * 1.0 / 255
    # print(blob.shape)
    model.setInput(blob)
    model_cuda.setInput(blob)
    # Perform inference
    outputs = model.forward()
    outputs_cuda = model_cuda.forward()
    print("The results of CPU")
    print(outputs)
    print("\nThe results of CUDA")
    print(outputs_cuda)
    r = np.allclose(outputs[0], outputs_cuda[0], atol=1e-4)
    if r:
        print("CPU and CUDA is same")
    else:
        print("CPU and CUDA is not same")

if __name__ == '__main__':
    main('mulmat.onnx')
```

### Issue submission checklist

- [X] I report the issue, it's not a question
- [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- [X] I updated to the latest OpenCV version and the issue is still there
- [X] There is reproducer code and related data files (videos, images, onnx, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concat operator fusion leads to inconsistent inference results between CPU and GPU. #24721

System Information

Detailed description

Steps to reproduce

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Concat operator fusion leads to inconsistent inference results between CPU and GPU. #24721

Description

System Information

Detailed description

Steps to reproduce

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions