When it comes to LLMs, 2023 was the year of Open Source AI.
At the end of 2022, the quality delta between best open (Bloom) and closed (GPT-3.5) LLM, as measured by MMLU scores, was 90%. At the end of 2023, this delta between GPT-4 and Mixtral-MoE-7B stands at 13%.
Co-founder, CEO @togethercompute
San Francisco, CA
Joined April 2008
- The era of sub-quadratic LLMs is about to begin. At @togethercompute we've been building next gen models with large space state architectures and training them on very long sequences and the results from the recent builds are... incredible. Will share more as we get closer to
- Together.ai API now offers a 32K context model, built with FlashAttention-2 for $0.20 per 1 M tokens. 300x cheaper than closest commercial model at 32K context (GPT-4). Smaller, but for many long context tasks like RAG, it’s excellent. And you can fine tune it.
- Now hearing fairly regularly how well RedPajama-INCITE-7B performs across enterprise use cases. Several companies have replaced OpenAI with it, and we will soon announce a new partner who is deploying solutions in regulated industries based on the model. huggingface.co/togethercomput…
- We just got 1024 A100s up and running at @togethercompute!! We are offering short-term dedicated access to AI startups anywhere from 16-128 GPUs. Clusters come pre-configured with distributed training software. Available immediately (while supplies last) 🚀🚀🚀
- The @togethercompute inference team achieved another performance milestone. Now serving 140 TPS on 671B param R1 model, ~3x faster than Azure, ~5.5x faster than DeepSeek API on @nvidia GPUs. APIs @ api.together.xyz and chat + web search @ chat.together.ai
- .@togethercompute is building 2 gigawatts of AI factories (~100,000 GPUs) in the EU over the next 4 years with the first phase live in H2 '2025. AI compute is at <1% saturation relative to our 2035 forecast and we are starting early to build a large-scale sustainable AI cloud
- OpenAI API compatibility shipped for 100+ models on @togethercompute API. Replace GPT calls with Mixtral or Llama-70B, get faster responses and for less $$ 🚀🚀🚀Transitioning from OpenAI to Mixtral? Simply add your TOGETHER_API_KEY, change the base URL to api.together.xyz, and swap the model name. Oh, and Mixtral Instruct v0.1 is now live on Together API 🙌
- The RedPajama-V2 dataset has been downloaded 1.2M times in the last month on @huggingface. It’s a great metric of the level of agency in core AI development today, and how vast the open source (and custom) AI surface is going to be.
- Llama-3 is Linux.
- The first @togethercompute GB200 cluster CDUs imbibing coolant in prep to go live next week! Each rack here is 1.4 exaflops of inference performance!
- Rolling out a new inference stack for DeepSeek R1 @togethercompute that gets up to 110 t/s on the 671B parameter model!
- Wow @anyscalecompute is benchmark washing their API’s terrible performance. All you need is curl and time. Same request @togethercompute 3x faster for Llama2 70B model — 72 t/s vs 23 t/s (7.04s vs 21.87s) And this model is under heavy load! Our dedicated instances are📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the
- We released Turbo and Lite versions of Llama-3 today that incorporate our latest research in optimization and quantization. Lite models are 6x cheaper than GPT-4o mini, possibly the most cost efficient inference in the world right now. Turbo models provide bestReplying to @togethercomputeTogether Lite endpoints provide the lowest cost for Llama 3, making high-quality AI models more affordable than ever, with Llama 3 8B Lite priced at $0.10 per million tokens, 6x lower cost than GPT-4o-mini. Together Lite leverages a number of optimizations including INT4











