{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T09:53:20Z","timestamp":1762509200853,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T00:00:00Z","timestamp":1737417600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Current applications of large language models (LLMs) in the field of code intelligence face issues related to low tokenization efficiency. This results in longer token sequences for input to source code types, which leads to the waste of contextual resources for large models. Additionally, the existing LLM tokenization technology struggles to ensure the contextual synonymity of variables. To address these problems, we propose a compiler-based compressed input sequence method. We focus on using the compiler\u2019s lexical analyzer for preliminary tokenization of the input statements, followed by tokenization and filtering through the large model\u2019s tokenizer. This approach results in shorter, semantically clearer, and higher-quality embedded token sequences. Then, using a contextual dictionary, the reduced tokens can be restored to their original state in the output statements. The experimental results show that our compressed input sequence method can be run smoothly in code generation scenarios. Compared to the baseline model, the compiler-based tokenization method can reduce the input token count by 33.7%. This study provides new insights for the application of LLMs in the field of code intelligence.<\/jats:p>","DOI":"10.3390\/info16020073","type":"journal-article","created":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T05:47:42Z","timestamp":1737438462000},"page":"73","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Research on Compressed Input Sequences Based on Compiler Tokenization"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-0512-1036","authenticated-orcid":false,"given":"Zhe","family":"Li","sequence":"first","affiliation":[{"name":"School of Software, Beihang University, Beijing 100191, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1652-3434","authenticated-orcid":false,"given":"Xinxi","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Software, Beihang University, Beijing 100191, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,1,21]]},"reference":[{"unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv.","key":"ref_1"},{"unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and Bhosale, S. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv.","key":"ref_2"},{"unstructured":"Wu, Y. (2016). Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv.","key":"ref_3"},{"key":"ref_4","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"doi-asserted-by":"crossref","unstructured":"Bostrom, K., and Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv.","key":"ref_5","DOI":"10.18653\/v1\/2020.findings-emnlp.414"},{"unstructured":"Brown, T.B. (2020). Language models are few-shot learners. arXiv.","key":"ref_6"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1162\/COLI_r_00312","article-title":"Neural network methods for natural language processing","volume":"44","author":"Liu","year":"2018","journal-title":"Comput. Linguist."},{"doi-asserted-by":"crossref","unstructured":"Sch\u00fctze, H., Manning, C.D., and Raghavan, P. (2008). Introduction to Information Retrieval, Cambridge University Press.","key":"ref_8","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1145\/2902362","article-title":"On the naturalness of software","volume":"59","author":"Hindle","year":"2016","journal-title":"Commun. ACM"},{"doi-asserted-by":"crossref","unstructured":"Xu, F.F., Alon, U., Neubig, G., and Hellendoorn, V.J. (2022, January 13). A systematic evaluation of large language models of code. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA.","key":"ref_10","DOI":"10.1145\/3520312.3534862"},{"doi-asserted-by":"crossref","unstructured":"Wong, M.F., Guo, S., Hang, C.N., Ho, S.W., and Tan, C.W. (2023). Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy, 25.","key":"ref_11","DOI":"10.3390\/e25060888"},{"doi-asserted-by":"crossref","unstructured":"Hellendoorn, V.J., and Devanbu, P. (2017, January 4\u20138). Are deep neural networks the best choice for modeling source code?. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany.","key":"ref_12","DOI":"10.1145\/3106237.3106290"},{"unstructured":"Dagan, G., Synnaeve, G., and Rozi\u00e8re, B. (2024). Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv.","key":"ref_13"},{"unstructured":"Karampatsis, R.M., and Sutton, C. (2019). Maybe deep neural networks are the best choice for modeling source code. arXiv.","key":"ref_14"},{"unstructured":"Feng, D., Zhang, Y., and Xu, Z. (2024). IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining. arXiv.","key":"ref_15"},{"doi-asserted-by":"crossref","unstructured":"Sachidananda, V., Kessler, J.S., and Lai, Y.A. (2021). Efficient domain adaptation of language models via adaptive tokenization. arXiv.","key":"ref_16","DOI":"10.18653\/v1\/2021.sustainlp-1.16"},{"doi-asserted-by":"crossref","unstructured":"Rabin, M.R.I., Hellendoorn, V.J., and Alipour, M.A. (2021, January 23\u201328). Understanding neural code intelligence through program simplification. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA.","key":"ref_17","DOI":"10.1145\/3468264.3468539"},{"doi-asserted-by":"crossref","unstructured":"Rabin, M.R.I., Hussain, A., and Alipour, M.A. (2022, January 13). Syntax-guided program reduction for understanding neural code intelligence models. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, New York, NY, USA.","key":"ref_18","DOI":"10.1145\/3520312.3534869"},{"doi-asserted-by":"crossref","unstructured":"Svyatkovskiy, A., Lee, S., Hadjitofi, A., Riechert, M., Franco, J.V., and Allamanis, M. (2021, January 28). Fast and memory-efficient neural code completion. Proceedings of the 2021 IEEE\/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain.","key":"ref_19","DOI":"10.1109\/MSR52588.2021.00045"},{"unstructured":"Li, Y., Qi, S., Gao, C., Peng, Y., Lo, D., Xu, Z., and Lyu, M.R. (2022). A closer look into transformer-based code intelligence through code transformation: Challenges and opportunities. arXiv.","key":"ref_20"},{"unstructured":"Zheng, Y., Suneja, S., Zhuang, Y., Morari, A., and Laredo, J.A. (2022). Probing Model Signal Awareness. (App. 17\/315,701), U.S. Patent.","key":"ref_21"},{"doi-asserted-by":"crossref","unstructured":"Sennrich, R. (2015). Neural machine translation of rare words with subword units. arXiv.","key":"ref_22","DOI":"10.18653\/v1\/P16-1162"},{"doi-asserted-by":"crossref","unstructured":"Gilda, S. (2017, January 7). Source code classification using Neural Networks. Proceedings of the 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), NakhonSiThammarat, Thailand.","key":"ref_23","DOI":"10.1109\/JCSSE.2017.8025917"},{"unstructured":"Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., and Tang, D. (2021). Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv.","key":"ref_24"},{"doi-asserted-by":"crossref","unstructured":"Allamanis, M., and Sutton, C. (2013, January 18\u201319). Mining source code repositories at massive scale using language modeling. Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA.","key":"ref_25","DOI":"10.1109\/MSR.2013.6624029"},{"unstructured":"Karampatsis, R.M., Babii, H., Robbes, R., Sutton, C., and Janes, A. (July, January 27). Big code!= big vocabulary: Open-vocabulary models for source code. Proceedings of the ACM\/IEEE 42nd International Conference on Software Engineering, New York, NY, USA.","key":"ref_26"},{"unstructured":"Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Sauvestre, R., and Remez, T. (2023). Code llama: Open foundation models for code. arXiv.","key":"ref_27"},{"doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","key":"ref_28","DOI":"10.3115\/1073083.1073135"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1017\/S1351324910000306","article-title":"Steven Bird, Evan Klein and Edward Loper. Natural Language Processing with Python","volume":"Volume 17","author":"Xue","year":"2009","journal-title":"Natural Language Engineering"},{"key":"ref_30","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1965","journal-title":"Dokl. Akad. Nauk. SSSR"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/2\/73\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T10:32:51Z","timestamp":1759919571000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/2\/73"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,21]]},"references-count":30,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["info16020073"],"URL":"https:\/\/doi.org\/10.3390\/info16020073","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2025,1,21]]}}}