Skip to content

Implementation of a table engine to consume application log files in ClickHouse#25969

Merged
kssenii merged 114 commits intoClickHouse:masterfrom
ucasfl:filelog-engine
Oct 24, 2021
Merged

Implementation of a table engine to consume application log files in ClickHouse#25969
kssenii merged 114 commits intoClickHouse:masterfrom
ucasfl:filelog-engine

Conversation

@ucasfl
Copy link
Copy Markdown
Collaborator

@ucasfl ucasfl commented Jul 4, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implementation of a table engine to consume application log files in ClickHouse. closes #6953.

Detailed description / Documentation draft:

Use case:

CREATE TABLE log (...) ENGINE= FileLog('/path/to/file_or_directory', 'FormatName');

CREATE TABLE target (...) ENGINE = ...;

CREATE MATERIALIZED VIEW mv TO target AS SELECT .... FROM log;

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Jul 4, 2021
@ucasfl ucasfl force-pushed the filelog-engine branch 3 times, most recently from 216d2ee to 2599987 Compare July 4, 2021 07:27
@kssenii kssenii self-assigned this Jul 4, 2021
@kssenii
Copy link
Copy Markdown
Member

kssenii commented Aug 18, 2021

@ucasfl hi! Will you finish this PR, or should I finish?

@ucasfl
Copy link
Copy Markdown
Collaborator Author

ucasfl commented Aug 19, 2021

@ucasfl hi! Will you finish this PR, or should I finish?

Hi, maybe you can give me more review and suggestions on current codes, I will continue to finish it recently.

Copy link
Copy Markdown
Member

@kssenii kssenii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also be very nice to add tests :)
As for your to-do list in pr description:
Multi-threads to parse records from files - will be very good.

@ucasfl
Copy link
Copy Markdown
Collaborator Author

ucasfl commented Sep 3, 2021

It will also be very nice to add tests :)
As for your to-do list in pr description:
Multi-threads to parse records from files - will be very good.

So, if we want to implement multi-threads to parse records from files, we should monitor directory in Storage and dynamically create ReadBuffer?

@kssenii
Copy link
Copy Markdown
Member

kssenii commented Sep 3, 2021

So, if we want to implement multi-threads to parse records from files, we should monitor directory in Storage and dynamically create ReadBuffer?

Yes, I was thinking it would be better to have a monitoring task in storage which will be started when we start a reading loop in threadFunc. It will be stopped when we reschedule and resumed (opening files once again) when main task is resumed.
Overall idea it can be similar to RabbitMQ or Kafka but probably in this case there is no need for a staticaly defined set of read buffers and they can be created dynamically once FileLogSources gets initialized.

@ucasfl ucasfl requested a review from kssenii October 21, 2021 15:34
@kssenii
Copy link
Copy Markdown
Member

kssenii commented Oct 23, 2021

Functional stateless tests (thread) -- failed because of Test internal error: ConnectionRefusedError -- not related.
Integration tests (thread) — Timeout, fail: 43 - not related to changes

@kssenii kssenii merged commit 7383bdd into ClickHouse:master Oct 24, 2021
@ucasfl
Copy link
Copy Markdown
Collaborator Author

ucasfl commented Oct 24, 2021

Thanks for carefully review! Learned a lot from it. @kssenii

@ucasfl ucasfl mentioned this pull request Oct 24, 2021
@alexey-milovidov
Copy link
Copy Markdown
Member

@ucasfl Congratulations! 🎉 ❤️

@sevirov
Copy link
Copy Markdown
Contributor

sevirov commented Oct 24, 2021

Internal documentation ticket: DOCSUP-17088

@@ -0,0 +1,15 @@
option (ENABLE_FILELOG "Enable FILELOG" ON)
Copy link
Copy Markdown
Member

@azat azat Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW ENABLE_FILELOG looks redundant, since usually find_foobar created only for some third-party libraries.
It can be enabled for linux only and that's it. @ucasfl thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with it.


fd = inotify_init();
if (fd == -1)
throw Exception("Cannot initialize inotify", ErrorCodes::IO_SETUP_ERROR);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwFromErrno ?

int wd = inotify_add_watch(fd, path.c_str(), mask);
if (wd == -1)
{
owner.onError(Exception(ErrorCodes::IO_SETUP_ERROR, "Watch directory {} failed", path));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwFromErrno ?


for i in {1..20}
do
echo $i, $i >> ${user_files_path}/a.txt
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use either unique name, or at least include test name, this way it can be run in parallel.


for i in {1..20}
do
echo $i, $i >> ${user_files_path}/logs/a.txt
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not only this test cannot be run in parallel but it also cannot be run in parallel with other tests from this PR, i.e. 02024_storage_filelog_mv.sh, since it uses the same name. Am I right?

azat added a commit to azat/ClickHouse that referenced this pull request Aug 6, 2022
This will allow to distinguish really corrupted data, since right now if
you will CREATE/DETACH/ATTACH such engine you will have the following
error [1]:

    2022.08.05 20:02:20.726398 [ 696405 ] {} <Error> StorageFileLog (file_log): Metadata files of table file_log are lost.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/39926/72961328f68b1ec05300d6dc4411a87618a2f46b/stress_test__debug_.html

Likely that previously it was not created to avoid creating empty
directories, however this should be a problem I guess.

Refs: ClickHouse#25969 (@ucasfl)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC. Allow to subscribe to File table.

8 participants