{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T22:40:08Z","timestamp":1776465608718,"version":"3.51.2"},"reference-count":31,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T00:00:00Z","timestamp":1717459200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Large vision-language models, such as Contrastive Vision-Language Pre-training (CLIP), pre-trained on large-scale image\u2013text datasets, have demonstrated robust zero-shot transfer capabilities across various downstream tasks. To further enhance the few-shot recognition performance of CLIP, Tip-Adapter augments the CLIP model with an adapter that incorporates a key-value cache model constructed from the few-shot training set. This approach enables training-free adaptation and has shown significant improvements in few-shot recognition, especially with additional fine-tuning. However, the size of the adapter increases in proportion to the number of training samples, making it difficult to deploy in practical applications. In this paper, we propose a novel CLIP adaptation method, named Proto-Adapter, which employs a single-layer adapter of constant size regardless of the amount of training data and even outperforms Tip-Adapter. Proto-Adapter constructs the adapter\u2019s weights based on prototype representations for each class. By aggregating the features of the training samples, it successfully reduces the size of the adapter without compromising performance. Moreover, the performance of the model can be further enhanced by fine-tuning the adapter\u2019s weights using a distance margin penalty, which imposes additional inter-class discrepancy to the output logits. We posit that this training scheme allows us to obtain a model with a discriminative decision boundary even when trained with a limited amount of data. We demonstrate the effectiveness of the proposed method through extensive experiments of few-shot classification on diverse datasets.<\/jats:p>","DOI":"10.3390\/s24113624","type":"journal-article","created":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T05:17:30Z","timestamp":1717478250000},"page":"3624","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Proto-Adapter: Efficient Training-Free CLIP-Adapter for Few-Shot Image Classification"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-2366-7601","authenticated-orcid":false,"given":"Naoki","family":"Kato","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Kanagawa, Japan"}]},{"given":"Yoshiki","family":"Nota","sequence":"additional","affiliation":[{"name":"Meidensha Corporation, Tokyo 141-6029, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7361-0027","authenticated-orcid":false,"given":"Yoshimitsu","family":"Aoki","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Kanagawa, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2024,6,4]]},"reference":[{"key":"ref_1","unstructured":"DeVries, T., and Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv.","DOI":"10.1007\/978-1-4899-7687-1_79"},{"key":"ref_3","unstructured":"Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014, January 8\u201313). How transferable are features in deep neural networks?. Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada."},{"key":"ref_4","unstructured":"Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., and Huang, J.B. (2019). A closer look at few-shot classification. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., and Isola, P. (2020, January 23\u201328). Rethinking few-shot image classification: A good embedding is all you need?. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XIV 16.","DOI":"10.1007\/978-3-030-58568-6_16"},{"key":"ref_6","unstructured":"Finn, C., Abbeel, P., and Levine, S. (2017, January 6\u201311). Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia."},{"key":"ref_7","unstructured":"Snell, J., Swersky, K., and Zemel, R. (2017, January 4\u20139). Prototypical networks for few-shot learning. Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Lin, H., Han, G., Ma, J., Huang, S., Lin, X., and Chang, S.F. (2023, January 17\u201324). Supervised masked knowledge distillation for few-shot transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01882"},{"key":"ref_9","unstructured":"Vinyals, O., Blundell, C., Lillicrap, T., and Wierstra, D. (2016, January 5\u201310). Matching networks for one shot learning. Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Li, F.-F. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_11","unstructured":"Bertinetto, L., Henriques, J.F., Torr, P.H., and Vedaldi, A. (2018). Meta-learning with differentiable closed-form solvers. arXiv."},{"key":"ref_12","unstructured":"Krizhevsky, A., and Hinton, G. (2023, September 01). Learning Multiple Layers of Features from Tiny Images; Technical Report. Available online: http:\/\/www.cs.utoronto.ca\/~kriz\/learning-features-2009-TR.pdf."},{"key":"ref_13","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1007\/s11263-023-01891-x","article-title":"Clip-adapter: Better vision-language models with feature adapters","volume":"132","author":"Gao","year":"2023","journal-title":"Int. J. Comput. Vis."},{"key":"ref_15","unstructured":"Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019, January 9\u201315). Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_17","unstructured":"Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., and Li, H. (2021). Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019, January 15\u201320). Arcface: Additive angular margin loss for deep face recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00482"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Krause, J., Stark, M., Deng, J., and Li, F.-F. (2013, January 2\u20138). 3d object representations for fine-grained categorization. Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia.","DOI":"10.1109\/ICCVW.2013.77"},{"key":"ref_20","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_21","unstructured":"Li, F.-F., Fergus, R., and Perona, P. (July, January 27). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Proceedings of the 2004 IEEE Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Nilsback, M.E., and Zisserman, A. (2008, January 16\u201319). Automated flower classification over a large number of classes. Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India.","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., and Torralba, A. (2010, January 13\u201318). Sun database: Large-scale scene recognition from abbey to zoo. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5539970"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A. (2014, January 23\u201328). Describing textures in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.461"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"2217","DOI":"10.1109\/JSTARS.2019.2918242","article-title":"Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification","volume":"12","author":"Helber","year":"2019","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_26","unstructured":"Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Parkhi, O.M., Vedaldi, A., Zisserman, A., and Jawahar, C. (2012, January 16\u201321). Cats and dogs. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"ref_28","unstructured":"Bossard, L., Guillaumin, M., and Van Gool, L. (2014). Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6\u201312, 2014, Proceedings, Part VI 13, Springer."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_30","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_31","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/11\/3624\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:53:22Z","timestamp":1760108002000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/11\/3624"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,4]]},"references-count":31,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,6]]}},"alternative-id":["s24113624"],"URL":"https:\/\/doi.org\/10.3390\/s24113624","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,4]]}}}