I'm building a simple Encoder-Decoder architecture for video with 3D Convolutions and TransposedConv. The aim is make it fully convolutional so it would work with any video size.
The Encoder looks like this:
padding = int((kernel_size - 1)/2) #kernel_size = 5
self.network = nn.Sequential(
nn.Conv3d(3, 128, kernel_size=kernel_size, stride=1, padding=padding),
nn.BatchNorm3d(128, affine=False),
nn.ReLU(True),
nn.Conv3d(128, 64, kernel_size=kernel_size, stride=2, padding=padding),
nn.BatchNorm3d(64, affine=False),
nn.ReLU(True),
nn.Conv3d(64, 64, kernel_size=kernel_size, stride=2, padding=padding),
nn.BatchNorm3d(64, affine=False),
nn.ReLU(True),
nn.Conv3d(64, 24, kernel_size=kernel_size, stride=1, padding=padding),
nn.BatchNorm3d(output_channels, affine=False),
nn.ReLU(True)
)
And here is the Decoder:
padding = int((kernel_size - 1)/2) #kernel_size = 5
self.network = nn.Sequential(
nn.ConvTranspose3d(self.input_channels, 64, kernel_size=kernel_size, stride=1, padding=padding),
nn.BatchNorm3d(64, affine=False),
nn.ReLU(True),
nn.ConvTranspose3d(64, 64, kernel_size=kernel_size, stride=2, padding=padding, output_padding=1),
nn.BatchNorm3d(64, affine=False),
nn.ReLU(True),
nn.ConvTranspose3d(64, 128, kernel_size=kernel_size, stride=2, padding=padding, output_padding=1),
nn.BatchNorm3d(128, affine=False),
nn.ReLU(True),
nn.Conv3d(128, self.output_channels, kernel_size=kernel_size, stride=1, padding=padding),
)
I trained the model with input size of (NxCxHxW) = (16x3x112x112). The outputs of each layer looks like this:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv3d-1 [-1, 128, 16, 112, 112] 48,128
BatchNorm3d-2 [-1, 128, 16, 112, 112] 0
ReLU-3 [-1, 128, 16, 112, 112] 0
Conv3d-4 [-1, 64, 8, 56, 56] 1,024,064
BatchNorm3d-5 [-1, 64, 8, 56, 56] 0
ReLU-6 [-1, 64, 8, 56, 56] 0
Conv3d-7 [-1, 64, 4, 28, 28] 512,064
BatchNorm3d-8 [-1, 64, 4, 28, 28] 0
ReLU-9 [-1, 64, 4, 28, 28] 0
Conv3d-10 [-1, 24, 4, 28, 28] 192,024
BatchNorm3d-11 [-1, 24, 4, 28, 28] 0
ReLU-12 [-1, 24, 4, 28, 28] 0
Encoder3D-13 [-1, 24, 4, 28, 28] 0
ConvTranspose3d-16 [-1, 64, 4, 28, 28] 192,064
BatchNorm3d-17 [-1, 64, 4, 28, 28] 0
ReLU-18 [-1, 64, 4, 28, 28] 0
ConvTranspose3d-19 [-1, 64, 8, 56, 56] 512,064
BatchNorm3d-20 [-1, 64, 8, 56, 56] 0
ReLU-21 [-1, 64, 8, 56, 56] 0
ConvTranspose3d-22 [-1, 128, 16, 112, 112] 1,024,128
BatchNorm3d-23 [-1, 128, 16, 112, 112] 0
ReLU-24 [-1, 128, 16, 112, 112] 0
Conv3d-25 [-1, 3, 16, 112, 112] 48,003
Decoder3D-26 [-1, 3, 16, 112, 112] 0
================================================================
After the the model got trained, I tested in on a UCF-101 video sequence of size (Frames x Width x Height) = (16 x 342 x 256) with os.environ["CUDA_LAUNCH_BLOCKING"] = "1" and got this error:
Traceback (most recent call last):
File "reconstruct.py", line 80, in <module>
torchsummary.summary(model, dataloader.dataset[0].shape)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torchsummary/torchsummary.py", line 72, in summary
model(*x)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/namle/VCM/E2E/Model/model.py", line 57, in forward
x_hat = self.decoder(y)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/namle/VCM/E2E/AutoEncoder/model3d.py", line 84, in forward
x = self.network(x)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/namle/anaconda3/envs/condapy3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 921, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA error: an illegal memory access was encountered
However if I ran the code on CPU and not using CUDA it will output a frame sequence of size (344 x 256) which is 2 pixels more in its width.
Another way to make to code run is at inference time change the output_padding in both TransposedConv layers in the Decoder to 0. Then I would get this sequence of size (341 x 253)
I hope this has something to do with the arithmetic as in this paper https://arxiv.org/pdf/1603.07285.pdf, rather than a bug. I'd appreciate if you can point me into the right direction, so that my model can take in any video size and reconstruct it to the same size automatically.
cc @ezyang @gchanan @zou3519 @ngimel
I'm building a simple Encoder-Decoder architecture for video with 3D Convolutions and TransposedConv. The aim is make it fully convolutional so it would work with any video size.
The Encoder looks like this:
And here is the Decoder:
I trained the model with input size of (NxCxHxW) = (16x3x112x112). The outputs of each layer looks like this:
After the the model got trained, I tested in on a UCF-101 video sequence of size (Frames x Width x Height) = (16 x 342 x 256) with
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"and got this error:However if I ran the code on CPU and not using CUDA it will output a frame sequence of size (344 x 256) which is 2 pixels more in its width.
Another way to make to code run is at inference time change the output_padding in both TransposedConv layers in the Decoder to 0. Then I would get this sequence of size (341 x 253)
I hope this has something to do with the arithmetic as in this paper https://arxiv.org/pdf/1603.07285.pdf, rather than a bug. I'd appreciate if you can point me into the right direction, so that my model can take in any video size and reconstruct it to the same size automatically.
cc @ezyang @gchanan @zou3519 @ngimel