deep-text-recognition-benchmark icon indicating copy to clipboard operation
deep-text-recognition-benchmark copied to clipboard

Is the network suit for long-text recognition?

Open WudiJoey opened this issue 4 years ago • 6 comments

Thanks for your work! I read your paper and notice that input images are resized to [224, 224]. In the case of long text line,does it influence the accuracy? Look forward to your reply!

WudiJoey avatar Jun 15 '21 12:06 WudiJoey

Addding: the width of the text image is often greater than the height. Can image information be preserved to the greatest extent if image is resized to square? Look forward to your reply~

WudiJoey avatar Jun 16 '21 01:06 WudiJoey

Hi, The resized images (224x224) are still human readable. The attention maps on square images also appear to be giving proper weights on each character region. Other than these, there was no empirical proof on how is the resizing affecting the accuracy. The alternative way is to resize to (100, 32) and use padding to scaled up to 224x224.

roatienza avatar Jun 16 '21 09:06 roatienza

Thanks for your reply~ I will try your work.

WudiJoey avatar Jun 16 '21 09:06 WudiJoey

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this? Screen Shot 2021-11-10 at 12 00 15

luvwinnie avatar Nov 10 '21 03:11 luvwinnie

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this? Screen Shot 2021-11-10 at 12 00 15

I haven't try your resize method because i think maybe large blank area will introduce useless infomation. I just resize my images to square directly and it can work. But i think there is a better way to process those long width images, like cutting the image and arrange them by rows.

WudiJoey avatar Nov 15 '21 03:11 WudiJoey

Thank you for reply! Cutting the image and arrange by rows seems like a very good way to do so, I would like to take a try.

Hmm...however currently it seems like the inputs is fixed by the base VisionTransformer, maybe we should find out a way to handle variable image just like convolution.... maybe the base Vision Transformer can be improved by using other latest vision transformer based network architecture

luvwinnie avatar Nov 15 '21 03:11 luvwinnie