With the recent release of #TinyLlama, SLMs have attracted a lot of attention. I re-released my previously trained SLM - LiteLlama under the MIT license, which has 460M parameters trained with 1T tokens. I hope to contribute a bit to the community.
Xiaotian (Max) Han
185 posts
Training LLM
- 🧑💻Spent <2 hrs learning Typst,@typstapp, a LaTeX alternative, and moved my CV (thx @mengliu_1998 template) to it. Compared to LaTeX's steeper learning curve, I'm really impressed with this lightweight, yet powerful, new typesetting system and will use it for non-paper things.💯✍️
- It will be a special experience to present our paper as an @TmlrOrg Track Oral Presentation this Thanksgiving night at @LogConference. Grateful to the organizers for scheduling this at such a special time. Join us if you're interested and available, and Happy Thanksgiving!🤔Sharing some old thoughts that graph convolution is equivalent to Mixup tx.ag/gcnmixup It was previously believed that the effectiveness of graph convolution stems from neighbor aggregation, which enhances or enriches the representation of the target node. However,
- 🔥Thrilled to train a lite LLaMa_1 and a lite LLaMa_2, now available on Huggingface! While they might not be the large models, they were trained across multiple nodes. First step into the world of LLM 🤗. View the training loss curve
- Super excited and deeply grateful to receive the #NeurIPS2023 Scholar Award to support my attending for the first time ever. Thanks to the great organizers, who are not average at all. See you all in New Orleans!Good day! Average researcher here going through thousands of NeurIPS financial aid applications trying to roll out all decisions by Monday. (Some of you should have already received decisions; others: thanks for your patience!)
- 📢 Looking for easy-to-use fairness baselines? Curious about utility-fairness trade-off control? Unsure about training endpoints? Check out our new benchmark paper for answers!👇 Code: github.com/ahxt/fair_fair… Paper: arxiv.org/abs/2306.09468 #AI #MachineLearning #Fairness
- Our SelfExtend (#ICML2024) was highlighted in a Google I/O session at youtu.be/TV7qCk1dBWA?t=… to demonstrate the long-context ability of Gemma. SelfExtend is already a go-to method to extend the context window not only for Gemma, but also for Llama, Mistral, and more. SelfExtendIf you missed it, this I/O session on LLMs with Keras 3 is a great tutorial on LLM training and fine-tuning best practices youtu.be/TV7qCk1dBWA
- Thrilled to share that this paper has been accepted by #ICLR2024! It offers a range of user-friendly fairness methods, metrics, and datasets. Please try them out! We hope this project can facilitate fairness research and welcome contributions of new fairness algorithms!📢 Looking for easy-to-use fairness baselines? Curious about utility-fairness trade-off control? Unsure about training endpoints? Check out our new benchmark paper for answers!👇 Code: github.com/ahxt/fair_fair… Paper: arxiv.org/abs/2306.09468 #AI #MachineLearning #Fairness
- 🚀 New Research: Thinking Preference Optimization! We boost LLM reasoning by using long CoT as preferred examples & short CoT as rejected in DPO training. ✨ Key insight: Careful curation of long/short CoT pairs enhances reasoning ability. tx.ag/ttpo
- Excited to attend #NeurIPS2023 from Dec 9-15! Can't wait to reconnect and meet new minds. 🎓I am on the academic job market for 2023-2024 and am keen on discussing opportunities! ahxt.github.io #academicjobs #openrank #tenuretrack
- We introduce Thinking Preference Optimization (ThinkPO)—a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO leverages existing short CoT responses as rejected answers and long CoT responses as chosen
- SelfExtend, without further training, upgrades Mistral-inst-v0.1 to match the performance level of its successor, v0.2, in qa tasks. therefore, the value of SelfExtend is at least equivalent to the training cost of Mistral-inst-v0.2?
- Replying to @cwolferesearchThanks for sharing!!! Thanks!! Please see our repo for the simple implementation github.com/datamllab/Long…
- Curious if LLM architecture improves over time? 🤔 We conducted a preliminary experiment comparing training loss curves for different architectures. To ensure a fair (relatively) comparison, we use 1) the same (almost) size parameter 2) the same training data 3) the same training










