Skip to content

feat: add a 1brc demo using Proton to the examples#658

Merged
jovezhong merged 24 commits intotimeplus-io:developfrom
ayewo:feat/1brc-demo
Apr 16, 2024
Merged

feat: add a 1brc demo using Proton to the examples#658
jovezhong merged 24 commits intotimeplus-io:developfrom
ayewo:feat/1brc-demo

Conversation

@ayewo
Copy link
Copy Markdown
Contributor

@ayewo ayewo commented Apr 12, 2024

PR checklist:

  • Did you run ClangFormat ? N/A
  • Did you separate headers to a different section in existing community code base ? N/A
  • Did you surround proton: starts/ends for new code in existing community code base ? N/A

This PR adds an example of using Proton to take part in the 1brc (One Billion Row Challenge) so I can /claim #527.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2024

CLA assistant check
All committers have signed the CLA.

@yokofly yokofly requested a review from jovezhong April 13, 2024 02:16
@ayewo ayewo changed the title [WIP] feat: add a demo1brc using Proton to the examples feat: add a 1brc demo using Proton to the examples Apr 13, 2024
@ayewo ayewo marked this pull request as draft April 13, 2024 09:22
@ayewo ayewo marked this pull request as ready for review April 13, 2024 22:25
@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 13, 2024

@jovezhong this is now ready for review, thanks.

@jovezhong
Copy link
Copy Markdown

/tip $25

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Apr 14, 2024

@jovezhong
Copy link
Copy Markdown

Thank you, @ayewo , for submitting the PR. We will share you review comments in next 3 working days. In the meanwhile, I have initiated the reward process to appreciate your effort regardless the result of PR merge.

@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 14, 2024

In the meanwhile, I have initiated the reward process to appreciate your effort regardless the result of PR merge.

Thanks, I appreciate that, @jovezhong.

The tip hasn’t been awarded yet. I think this is because since this is your first tip, you’ll need to do a one-time setup in Algora.

The @algora-pbc bot already shared the link for the one-time setup in a comment above: https://console.algora.io/org/timeplus/bounties?status=open but the messaging is not clear that there’s one more step for you before the tip is paid.

@jovezhong
Copy link
Copy Markdown

yes, I tried that link and realized I need to invite our COO to join the algora organization to setup the payment method with corporate card. Since it's weekend, please allow a couple days delay. Once the PR is merged, you will get the other reward. Maybe these two will be paid together. Feel free to DM me on community slack regarding algora workflow.

I will share the PR review comments in 3 business days.

@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 15, 2024

Comparing ClickHouse with Proton

Below are the execution times for ClickHouse and Proton when I ran the 1brc demo on my machine:

ClickHouse Proton
47.053 seconds 39.038 seconds

Extra Optimization Step

One optimization technique used by the author of the DuckDB SQL solution to make the query finish faster was to convert the 1brc CSV into a columnar-format like Parquet.

Converting the CSV to Parquet

  1. Install csv2parquet:
brew install domoritz/homebrew-tap/csv2parquet
  1. Create a schema.json that will guide csv2parquet on how to do the conversion from CSV-to-Parquet:
cat <<EOF > schema.json
{
  "fields": [
    {
      "name": "city",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "temperature",
      "data_type": "Float64",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}
EOF
  1. Start the conversion process using the schema.json file:
time csv2parquet --header true --delimiter ';' --schema-file schema.json measurements.txt measurements.parquet

Comparing ClickHouse with Proton Again

After performing the conversion, I was hoping I could further reduce the execution for both databases. These are the execution times on my machine (I tested multiple times):

Format ClickHouse Proton
CSV 47.053 seconds 39.038 seconds
Parquet 20.431 seconds 96.62 seconds

It seems Proton is not as competitive against ClickHouse when the dataset is in the Parquet format.

I did a quick cursory search by comparing the ClickHouse and Proton repos with respect to their handling of the Parquet format and it seems that several optimizations that have landed in ClickHouse have yet to find their way into the Proton repo. (Maybe those optimizations are in the private fork of Proton used in Timeplus Cloud?)

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Apr 15, 2024

🎉🎈 @ayewo has been awarded $25! 🎈🎊

Copy link
Copy Markdown

@jovezhong jovezhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ayewo. I just shared my initial comments. Looks great and happy to see the nice number. Please use Timeplus Proton instead of Proton. I would also suggest running a test with data stored in Timeplus. Since we use columnar format, the aggregation should run faster than others. In the meanwhile, I will also reproduce your test/demo.

Any plan to publish this in any blog systems? We will be happy to post this on timeplus.com/blog for sure, but if you have audience in other platforms, feel free to post it there, once the blog is finalized.

A video will be nice to have.

@jovezhong
Copy link
Copy Markdown

Comparing ClickHouse with Proton Again

Format ClickHouse Proton
CSV 47.053 seconds 39.038 seconds
Parquet 20.431 seconds 96.62 seconds
It seems Proton is not as competitive against ClickHouse when the dataset is in the Parquet format.

I did a quick cursory search by comparing the ClickHouse and Proton repos with respect to their handling of the Parquet format and it seems that several optimizations that have landed in ClickHouse have yet to find their way into the Proton repo. (Maybe those optimizations are in the private fork of Proton used in Timeplus Cloud?)

Thanks for going to the extra mile to test Parquet. You might be right. If we port the latest optimizations about Parquet from ClickHouse community to Timeplus Proton, the numbers could be better. I know the Timeplus Proton engineers are working on some enhancements for Avro and nested data at this moment, Parquet enhancement will be planned after that.

On the other hand, Proton or ClickHouse is more than a federated search like Trino (no storage, query on the fly). I think if you load the 1 billion rows to Proton table storage, it performs even better than Parquet, because of database index, SIMD, etc. Not sure whether the players in 1brc challenge commonly use database index.

@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 15, 2024

On the other hand, Proton or ClickHouse is more than a federated search like Trino (no storage, query on the fly). I think if you load the 1 billion rows to Proton table storage, it performs even better than Parquet, because of database index, SIMD, etc. Not sure whether the players in 1brc challenge commonly use database index.

I just learned of Trino from your comment, nice.

Well, the time it takes to load 1 billion rows into Proton's storage will often be quite significant. By the time it is done ingesting a 13GB file, the load time will end up negating any potential speedup from running the query from an internal table (with indexes).

One of folks that used Oracle did try using an index, but the gains were marginal at best.

@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 15, 2024

Thank you @ayewo. I just shared my initial comments. Looks great and happy to see the nice number.

Thanks for the detailed feedback.

Any plan to publish this in any blog systems? We will be happy to post this on timeplus.com/blog for sure, but if you have audience in other platforms, feel free to post it there, once the blog is finalized.

I could re-publish it on my blog but I imagine you'll publish it first so your blog can have the canonical link, correct?

A video will be nice to have.

I'm thinking of titling it: "1brc Database Shootout: Postgres vs DuckDB vs ClickHouse vs Timeplus Proton" where the execution times of the above mentioned databases is compared on an EC2 instance.

Specifically it will be on a c5a.12xlarge which was also used in this blog post.

@jovezhong
Copy link
Copy Markdown

Hi @ayewo , thanks for the update. I think this PR is ready to be merged. I will ping you in our community slack for the award details.

Once the PR is merged. I will work with our designer and website editor to publish this to timeplus.com/blog, then you can post the same content in your website with the canonical link to the post on timeplus.com.

If you can work on a video, you probably will get the 3rd reward.

@jovezhong
Copy link
Copy Markdown

/tip $150

@jovezhong
Copy link
Copy Markdown

Hi @ayewo , as we discussed on Slack, I just initiated the 2nd reward to you. The program output (those content in the pre tag) are not eligible for word count, since they are generated. The eligible content word count is between 1031 to 1227, depending on whether SQL or footnote links are counted. You are paid for the 1001-1500 word count tier. If you publish a video about this, you will be eligible for another award.

Thank you for the great content and look forwards to more awesome learning&sharing from you.

I will merge the PR now and work on the blog publishing on timeplus.com/blog

@jovezhong jovezhong merged commit a3c4090 into timeplus-io:develop Apr 16, 2024
@ayewo
Copy link
Copy Markdown
Contributor Author

ayewo commented Apr 16, 2024

Hi @jovezhong

It was awesome working with you and your other team mate on this. And thanks for being one of the few projects on Algora that pay promptly 👍.

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Apr 16, 2024

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Apr 16, 2024

🎉🎈 @ayewo has been awarded $150! 🎈🎊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants