feat: add a 1brc demo using Proton to the examples#658
feat: add a 1brc demo using Proton to the examples#658jovezhong merged 24 commits intotimeplus-io:developfrom
1brc demo using Proton to the examples#658Conversation
1brc using Proton to the examples1brc demo using Proton to the examples
|
@jovezhong this is now ready for review, thanks. |
|
/tip $25 |
|
Thank you, @ayewo , for submitting the PR. We will share you review comments in next 3 working days. In the meanwhile, I have initiated the reward process to appreciate your effort regardless the result of PR merge. |
Thanks, I appreciate that, @jovezhong. The tip hasn’t been awarded yet. I think this is because since this is your first tip, you’ll need to do a one-time setup in Algora. The @algora-pbc bot already shared the link for the one-time setup in a comment above: https://console.algora.io/org/timeplus/bounties?status=open but the messaging is not clear that there’s one more step for you before the tip is paid. |
|
yes, I tried that link and realized I need to invite our COO to join the algora organization to setup the payment method with corporate card. Since it's weekend, please allow a couple days delay. Once the PR is merged, you will get the other reward. Maybe these two will be paid together. Feel free to DM me on community slack regarding algora workflow. I will share the PR review comments in 3 business days. |
Comparing ClickHouse with ProtonBelow are the execution times for ClickHouse and Proton when I ran the
Extra Optimization StepOne optimization technique used by the author of the DuckDB SQL solution to make the query finish faster was to convert the Converting the CSV to Parquet
brew install domoritz/homebrew-tap/csv2parquet
cat <<EOF > schema.json
{
"fields": [
{
"name": "city",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": "temperature",
"data_type": "Float64",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
"metadata": {}
}
EOF
time csv2parquet --header true --delimiter ';' --schema-file schema.json measurements.txt measurements.parquetComparing ClickHouse with Proton AgainAfter performing the conversion, I was hoping I could further reduce the execution for both databases. These are the execution times on my machine (I tested multiple times):
It seems Proton is not as competitive against ClickHouse when the dataset is in the Parquet format. I did a quick cursory search by comparing the ClickHouse and Proton repos with respect to their handling of the Parquet format and it seems that several optimizations that have landed in ClickHouse have yet to find their way into the Proton repo. (Maybe those optimizations are in the private fork of Proton used in Timeplus Cloud?) |
|
🎉🎈 @ayewo has been awarded $25! 🎈🎊 |
jovezhong
left a comment
There was a problem hiding this comment.
Thank you @ayewo. I just shared my initial comments. Looks great and happy to see the nice number. Please use Timeplus Proton instead of Proton. I would also suggest running a test with data stored in Timeplus. Since we use columnar format, the aggregation should run faster than others. In the meanwhile, I will also reproduce your test/demo.
Any plan to publish this in any blog systems? We will be happy to post this on timeplus.com/blog for sure, but if you have audience in other platforms, feel free to post it there, once the blog is finalized.
A video will be nice to have.
Thanks for going to the extra mile to test Parquet. You might be right. If we port the latest optimizations about Parquet from ClickHouse community to Timeplus Proton, the numbers could be better. I know the Timeplus Proton engineers are working on some enhancements for Avro and nested data at this moment, Parquet enhancement will be planned after that. On the other hand, Proton or ClickHouse is more than a federated search like Trino (no storage, query on the fly). I think if you load the 1 billion rows to Proton table storage, it performs even better than Parquet, because of database index, SIMD, etc. Not sure whether the players in 1brc challenge commonly use database index. |
I just learned of Trino from your comment, nice. Well, the time it takes to load 1 billion rows into Proton's storage will often be quite significant. By the time it is done ingesting a 13GB file, the load time will end up negating any potential speedup from running the query from an internal table (with indexes). One of folks that used Oracle did try using an index, but the gains were marginal at best. |
Thanks for the detailed feedback.
I could re-publish it on my blog but I imagine you'll publish it first so your blog can have the canonical link, correct?
I'm thinking of titling it: "1brc Database Shootout: Postgres vs DuckDB vs ClickHouse vs Timeplus Proton" where the execution times of the above mentioned databases is compared on an EC2 instance. Specifically it will be on a |
|
Hi @ayewo , thanks for the update. I think this PR is ready to be merged. I will ping you in our community slack for the award details. Once the PR is merged. I will work with our designer and website editor to publish this to timeplus.com/blog, then you can post the same content in your website with the canonical link to the post on timeplus.com. If you can work on a video, you probably will get the 3rd reward. |
|
/tip $150 |
|
Hi @ayewo , as we discussed on Slack, I just initiated the 2nd reward to you. The program output (those content in the pre tag) are not eligible for word count, since they are generated. The eligible content word count is between 1031 to 1227, depending on whether SQL or footnote links are counted. You are paid for the 1001-1500 word count tier. If you publish a video about this, you will be eligible for another award. Thank you for the great content and look forwards to more awesome learning&sharing from you. I will merge the PR now and work on the blog publishing on timeplus.com/blog |
|
Hi @jovezhong It was awesome working with you and your other team mate on this. And thanks for being one of the few projects on Algora that pay promptly 👍. |
|
🎉🎈 @ayewo has been awarded $150! 🎈🎊 |
PR checklist:
proton: starts/endsfor new code in existing community code base ? N/AThis PR adds an example of using Proton to take part in the
1brc(One Billion Row Challenge) so I can /claim #527.