-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Intern Tasks 2022/2023 #42194
Description
This is the list of proposed tasks. It is to be extented. You can propose more tasks.
You can also find the previous lists here:
2021/2022: #29601
2020/2021: #15065
2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43
2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156
The tasks should be:
- not too hard (doable within about a month) but usually not less than a week;
- not alter some core components of the system;
- mostly isolated, does not require full knowledge of the system;
- somewhat interesting to implement or have some point of research;
- not in the critical path of our roadmap (ok to be thrown away after a year);
- most of them are for C++ developers, but there should be also tasks for frontend developers or tools/research that only require Go/Python/whatever;
- some tasks should allow teamwork;
- cover various skills, e.g. system programming, algorithm knowledge, etc...
Implementation of "clickhouse-top" interactive command-line application
@a-kush and, independently, @AlexNsf
If you like terminal UIs.
A terminal application, similar to btop++ that presents a dashboard of the most important ClickHouse and system metrics for local or remote instances or clusters: currently running queries, resource usage, etc.
SSH protocol for ClickHouse
Booked
If you like networks and Linux.
With the help of libssh2, ClickHouse server will support ssh connections. For every client connecting to the ssh port, it will create a pseudoterminal (pty) and run clickhouse-client embedded into the server to provide an interactive experience. Users will only need an ssh client to get an interactive clickhouse-client experience.
Depends and continues #41109.
Note: clickhouse-client inside clickhouse-server should be restricted in comparison to a standalone clickhouse-client.
Note: similar effect can be implemented with sshd, but running clickhouse-client as is will be insecure.
Websocket protocol for ClickHouse
Backend + Frontend.
Websocket protocol will be similar to the existing HTTP protocol, but provide more-granular and reactive communication and include the capabilities of the native protocol, such as progress, metrics and logs packets, and streaming result consumption.
It should be used in the ClickHouse Play UI to show query progress, metrics, and logs.
Web app with a terminal emulator running clickhouse-client
ClickHouse server will create a pseudoterminal (pty) and run clickhouse-client embedded into the server and communicate with the client by sending terminal commands over websocket. The client will run a javascript terminal emulator library (there are a lot of them: https://xtermjs.org/).
Note: clickhouse-client inside clickhouse-server should be restricted in comparison to a standalone clickhouse-client.
Note: this task can be implemented together with the "Websocket protocol for ClickHouse" as it will use the websocket protocol.
Connection and DNS logger in ClickHouse
Depends on #41198
Add a handler to ClickHouse server that will accept connections and immediately close them. All the connection info will be written into a predefined table.
Add another handler to ClickHouse server that will accept DNS requests. All the info about the request will be written into a predefined table.
It can be used for passive collection of HTML preconnect and dns-prefetch:
https://developer.mozilla.org/en-US/docs/Web/HTML/Link_types/preconnect
And subsequently, this mechanism can be used as a rudimentary web analytics system.
Note: there is not much point in embedding this feature into ClickHouse, as you can simply use netcat + tcpdump and write data into ClickHouse as a script. Nevertheless, this task serves as a good exploration.
Dataset watermarking
Data Science, any programming language (it is even possible to write it in Rust).
You distribute a dataset to several parties but want to add random undetectable modifications to make it possible to identify a party if the dataset will be distributed further.
Add random modifications to the data to track the dataset source.
Encode a 64-bit identifier by pseudorandomly altering the data, so it can be extracted back by xor-ing with the original dataset, but keep the following properties:
- spread the modifications uniformly across the dataset, so even if only part of it has been redistributed, the watermark can be still read;
- keep the ratio of modified bits of the dataset low;
- the modifications should be undetectable by the third parties - only make the low-entropy changes under some data model of the data (Markov model) - for example, prefer to replace the word "make" with "cake" instead of ")ake";
- the watermark should not repeat in the data - you can use streaming encryption for it;
- the watermark should be readable even if another watermark is applied or if the dataset has been slightly modified by a third party;
- the watermark should be readable even if the data was reordered, or if some columns were removed or new columns were added.
Integration of SQLSmith to ClickHouse
Not too hard, low amount of code.
SQLSmith represents a good randomized testing tool. Check it with ClickHouse and see what will happen. Try to automate it for ClickHouse continuous integration.
Real-time resultset updates during query execution
Booked, @alexX512
Add an option to allow query pipeline return results on partially read data.
Extend the protocol to allow these "preview" results to be interactively displayed in clickhouse-client.
For example, for a query with GROUP BY, it will allow showing continuously accumulated data in the client.
The minimum goal of this task is to add an option to show the partial result on query cancellation.
Intelligent auto-complete based on the history from query_log
Booked
Backend + Data Science + optionally ML.
clickhouse-client will create a text model from the query_log and use it for completion. Recent queries have a higher weight. The text model can be built on top of the tokens after Lexer. It can be a mix of order-0..2 frequencies.
If you're lucky, you can use modern ML with deep learning (clickhouse-client will load pre-trained models).
Secondary indices for Memory tables
Available
Allow defining INDEXes of type hash for Memory tables. They will be represented simply as hash tables pointing to a block and a row number inside a block. These indices can speed-up lookup queries.
Bloom-filter indices can be also implemented.
Query optimization with block-level hints
The query pipeline is processing blocks of data. When a block of data is read from the source or when it is processed, we can attach some hints (knowledge) about the data in form of expressions with some conditions, like "x <= 10". Then these hints can be transformed, passed along in the query pipeline, and used to optimize some calculations.
Unification of data caches and memory allocators
Booked, @NikitaEvs
Extra hard
Every modern allocator (malloc/free) implementation does not return memory to the OS on "free" immediately but keeps it inside its own caches (global cache and per-thread caches), so it will be quickly available for subsequent allocations.
If you use a cache for data in the application (for example a cache for query results), the cached entries are allocated in memory and the memory is freed when the cache entry is evicted. But it's easy to see that the data is still available in the allocator's cache even after it is evicted from the data cache. The data cache has freed the memory and no longer knows about the data, while the data still resides in memory. Keeping it in memory without it being accessible is pure waste.
Data caches in the application represent disposable memory - the memory that can be freed on demand. In case of memory shortage, the memory allocator can evict some entries from the cache and reuse their memory segments.
If there are many different caches in the application (query result cache, uncompressed cache, file page cache...) it is unclear how to configure them together - what maximum sizes to specify, and what should be the total size of the caches relative to the total memory amount on the server. A unified cost model can accommodate all the caches together.
The OS kernel has the page cache - for the data read from block devices. This cache is transparent for the application but is not controllable at all. It is not possible to specify relative query priorities in terms of cache usage, restrict the maximum amount of page cache per query, etc. Implementing a user-space page cache can solve this problem.
There is a minimum goal:
- allow data caches to occupy almost all free memory while allowing the general-purpose memory allocator to take memory from the cache.
And a maximum goal:
- implement userspace page cache and make almost all other caches in ClickHouse unified.
Is it a replacement for memory allocation functions?
Yes, there will be a custom malloc, and for not-too-small sizes will be aware of the cache and cooperate with it.
The unified cost model is pretty easy to implement.
The cache will store arrays (spans of memory with data).
We can track:
- the size of the array;
- how much time did it take to do the calculations to get the data;
- exponentially smoothed access frequency (which is the mix of LRU and LFU and essentially a predictor of the next access time);
Heuristic schema inference for freeform log formats
@Posrabi
C++, Data Science, Very hard
Integration OSQuery to ClickHouse
If you like Linux
Represent info about the OS as a database engine, table engines, and table functions.
This can be extended to get information from Kubernetes API and from the AWS metadata service.
✔️ Support for PRQL as a query language in ClickHouse
If you like programming languages.
After the brilliant work of Denis Bolonin, ClickHouse can statically link Rust libraries: https://clickhouse.com/docs/en/development/integrating_rust_libraries/
PRQL https://prql-lang.org/ is a declarative language similar to SQL, implemented as a Rust library. We can plug it into ClickHouse as a dialect.
Comparison of memory allocators in ClickHouse
Booked
Evaluate, compare and analyze various malloc libraries, including rpmalloc, mimalloc, and others: #34157
The ultimate goal of the task is to make a better memory allocator for ClickHouse.
Advanced code coverage collection on a per-test basis
If you like compilers.
The idea is to calculate code coverage with the detail of every single test. Then we will have the info - what tests covered the particular line of code, and what lines of code were covered by a particular test. With this info, we can add a TF*IDF-like metric for test relevance with respect to code modifications and run tests in order of importance for the changes.
For implementation, we will use -f sanitize=coverage option of clang. It adds callbacks or counters to every basic block of the code. The information of the currently running test will be saved in a thread-local or a global variable. A custom format for intermediate coverage info should be added.
We will also implement a small UI to navigate this information.
Note: it is theoretically possible to reuse the available code coverage mechanism - simply by running all the tests sequentially and requesting a dump of the coverage info after every test. But it is extremely inefficient (proven to be non-practical).
Fuzzy GROUP BY for data clustering
Algorithms.
Add a modifier WITH CLUSTER to GROUP BY element, defining a maximum distance between groups. Groups located close to each other will be merged.
Example:
SELECT user_id, min(event_time), max(event_time) FROM table GROUP BY user_id, event_time WITH CLUSTER 1800
This query will group together sessions of events for every user_id, where "session" is a sequence of events, with the distance between nearest events no longer than 1800 (half an hour).
It can be extended to the multidimensional case:
Example:
SELECT user_id, count() FROM table GROUP BY user_id, (x, y) WITH CLUSTER 10
This query will group together points (x, y) with L2-distance less than 10.
✔️ Direct import from files inside tar/zip/7z archives
Reading from files located inside archives with multiple files.
There are multiple places where we support reading from external files:
- SELECT ... FROM, INSERT INTO file/url/s3/hdfs
- INSERT ... FROM INFILE
And we already support:
- transparent reading of compressed files: gz, xz, bz2, lz4, zstd;
- globs for reading from multiple files: file{1..100}, file*... and parallel processing of them.
Now imagine files are located inside an archive with multiple files - like tar, zip, 7z - and we want to process a subset of files from the archive.
Example:
SELECT * FROM url('https://example.com/myfile{01..10}.tar.gz :: path/hello{1..9}.tsv')
Implementation will require:
- reading the archives and extracting ReadBuffers (byte streams) for a subset of files; if an archive contains an index, perform seeks for reading files; if not - implement streaming reads of files in the order of how they are placed in the archive.
Note: ClickHouse codebase already has support for reading files from zip archives.
Tamper-proof data storage with blockchain
Booked, @rubin-do
MergeTree tables in ClickHouse are represented by immutable checksummed data parts (the data parts are never modified, they can only be added and deleted). We can add a cryptographic hash to every data part (along with the existing checksum) and a cryptographic hash of all containing parts (to form a Merkle Tree).
It will provide the user a way to validate that the data was not altered (the user can validate the hash).
We can also implement hash-cash (proof of work) for INSERTs. Every INSERT can be accompanied by some random number and allowed to be inserted by ClickHouse server only if the hash of the existing data, a hash of the new data, and a random number together hashed as a number divisible by some large number. So, clickhouse-server can be opened for INSERT queries to the public, but accept a limited amount of data to be inserted with an automatically increasing price.
Although it looks mostly like a proof of concept, it has some practical applications like INSERTing data directly by JavaScript from third-party web pages.
As an alternative, the INSERT query can contain proof of payment in another blockchain (Etherium) and the clickhouse-server can validate this proof as well as the authenticity of the insert request.
Cryptographic hashes for the data can be used to implement distributed content-addressable hosting of ClickHouse datasets.
Another application is reusing Git protocol and repositories for metadata replication - to store the data catalog with the set of data parts with hashes. This is also for proof of concept.
Please don't mint NFTs with ClickHouse.
Advanced compression methods in ClickHouse
Booked, @romanovsavelij
Modern general-purpose compression methods can be classified in the following categories in order of ratio/speed:
- LZ77 and similar: LZ4, Snappy: 2..5 GB/sec/core
- LZ77+Entropy: ZSTD, Brotli: 100..1000 MB/sec/core
- LZMA and BWT based: 10..100 MB/sec/core
- PPMD, PAQ: 1..10 MB/sec/core and less
- Neural network based: NNCP: extremely slow
ClickHouse is using only categories 1 and 2, as others are too slow and impractical.
Although some modern implementations in category 3 have appeared: bsc and csc,
they still can have some applications, for example - reading external data over a very slow network connection (home internet). Let's add them to ClickHouse and see what will happen.
See also https://encode.su/forums/2-Data-Compression
Note: there was also a task to add experimental libraries in category 1: LZSSE, Lizard, and density, but all of them appeared to be abandoned and have numerous bugs. If more interesting libraries will be found, they can be integrated as well.
Remote filesystem over ClickHouse server
ClickHouse has support for virtual filesystems for data storage. This is used for integration with S3 and HDFS, to store datasets on remote web servers, and to implement data encryption.
Theoretically, ClickHouse can work over distributed filesystems (Ceph, GlusterFS, MooseFS, Lustre). Somehow it even works on NFS. But it is inefficient due to mismatching levels of abstraction (granularity of operations and error handling).
This task is to implement a handler to serve remote IO requests in ClickHouse and to allow plugging it in as a remote filesystem for ClickHouse. It is not a distributed filesystem, but more like simply plugging remote storage.
Integration of ClickHouse with Let's Encrypt
Let's Encrypt is the easiest way to get SSL certificates, and it can be easily automated with certbot.
Unfortunately, it is not so easy:
- port 80 should be opened on a server for certbot callback request;
- if multiple replicas are used under the same domain name, the keys and certificate have to be manually copied across them;
- this should be repeated every three months.
The idea is to add integration with Let's Encrypt directly into ClickHouse, so it will request the certificates instead of certbot.
The motivating example is Caddy HTTP server. Unlike Nginx it has integration with Let's Encrypt and it is considered a major advantage.
Size-capped Tables
✔️ Improvements for file table function and clickhouse-local
- Optimization of reading with mmap.
- An option to rename files after processing.
- File/S3/HDFS as database engine.
- Overlay database engine and representation of a directory as a database.
- Dashed settings in command-line arguments.
- Support for unicode dashes for command-line arguments.
Improvements for array and aggregate functions
- groupArrayIntersect
- groupConcat
- arrayFold
Improvements for import/export data formats
- MYD format
- XML as input format
- SQLite as a format
- AvroConfluent as output format
- JSON formats under a specified path
- HDF5
- sas7bdat
- mdb, dbf
Refreshable Materialized Views
Booked
Improvements for s3 integration
- ~/.aws in clickhouse-local.
- Get the default region from IMDS.
- Support for s3-style URLs.
- Support for PROFILE and ~/.aws/config.
Time series analysis with aggregate functions
- stationary tests
- shock events detection
- Holt-Winters forecast
JOIN optimization with in-memory compression
@Yur3k, @watemus
Enhancements for the URL storage engine
- Allow to set up retries and fire-and-forget mode on INSERT.
- Support for directory listing by parsing Apache index pages or just a bunch of links from HTML.
✔️ Streaming import from a bunch of files in object storage
Alternative network protocols for replication
Booked @BoloniniD
ClickHouse is using physical replication. Replicas are exchanging files. The files are sent over a TCP connection.
The TCP protocol is designed to send a stream of bytes in sequential order. For networks with high packet drop, high bandwidth-delay product, and with the default setting of TCP congestion control in Linux, TCP protocol is inefficient at transferring files, especially if a single TCP connection is used. The effect is easily reproduced by having replicas in different geographical continents.
Try using multiple TCP connections to transfer different parts of a file, try other protocols such as enet, μTP, or UDT.
SFTP support in ClickHouse
Allow reading files over the ssh-FTP protocol with sftp table function.
Text classification with ngram models
@AntonGorokhov and, independently, @0442A403
Backend, Algorithms, Data Science
The user can configure a set of text models, containing frequencies of ngrams or shingles in various text classes. ClickHouse server will load these models. A user will be able to classify text with SQL functions referencing these models like
SELECT ngramClassify('offensive', text)
A simple Bayes classifier can be used. The implementation should have an emphasis on performance rather than on precision.
Large-scale internet technology survey
Data Science, Web
This task does not require C++ knowledge and is implemented mostly in SQL.
Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) make research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/
Analysis of reused and adopted code on GitHub
Data Science
This task does not require C++ knowledge and is implemented in any programming language.
Given a dataset of almost all code on GitHub (50 TB), using code normalization and fingerprinting, make a database and a small service to easily find derived code across the repositories.
Code metrics service
Full Stack, Frontend
Make a web service to present various metrics of a Git repository, such as:
— files with minimum and maximum number of authors;
— the oldest lines of code in the repository;
— the files with the longest history;
— the favorite files of every author of the code;
— large files with a low number of authors;
— what weekdays are the best for writing code,
so it has less chance to be removed in the future;
— files sorted by code age;
— who has written the most code;
— who has written the most code, that has not been removed yet;
— code of what authors are rewritten the most frequently;
— find pairs of authors where one author is writing the code and another removing that code;
— median time for lines of code removal;
— what code was rewritten the most number of times;
— who likes to write comments or tests the most;
https://www.youtube.com/watch?v=y8HsXqQEyfE
https://presentations.clickhouse.com/matemarketing_2020/
Merge process visualization and optimization
Full Stack, Frontend
JavaScript SPA that loads data from system.part_log and represents the merging process as animated graphics (data parts as rectangles, area proportional to the size, tuneable play speed).
Calculate metrics of the efficiency of the merge algorithm. Implement a merge algorithm in JavaScript to model it and try to optimize the metrics.
Vector Search in ClickHouse
Booked
ML, Backend
Continuation of #35101
Descriptive analysis of datasets
Data Science, ML
Given an unknown dataset, try to present a few of the most descriptive and the most useful charts and statistics about this dataset with zero user input.
Integration with Tensorflow
Or another ML library.
Allow to plug in pre-trained models for inference in ClickHouse and evaluate them as SQL functions on numeric and text data.
Demonstrate at least a few interesting applications of the models.
The most straightforward path will be static linking of TensorFlow-lite.
Natural language to SQL queries
Booked, @FArthur-cmd
ML
Apply https://github.com/yandex-research/sparqling-queries to allow querying ClickHouse with natural language.
A function for cleaning up URLs
Easy
Functions for compatibility with MySQL dialect
Find functions available in MySQL but not present in ClickHouse, or present under a different name. Implement the missing functions or add compatibility aliases.
✔️ Integration with Apache Iceberg
Support for reading of external datasets in Apache Iceberg format, similar to #41054
Peephole optimizations for ClickHouse
Implement a bunch of peephole optimizations. Example:
- replace
if(orgs.createdAt > toDateTime('2022-10-04 13:00:00'), true, false) IN (1)toorgs.createdAt > toDateTime('2022-10-04 13:00:00')
It will speed up some autogenerated queries by BI tools.
Cluster monitoring UI for ClickHouse
Frontend
A JavaScript SPA interacting with ClickHouse server implementing several views:
- a cluster overview: diagram of shards and replicas with some basic health information;
- an overview of resource consumption across the cluster;
- a process list of currently running queries with information about resource consumption;
- navigation in the query log with filtering, searching, and grouping;
- metrics overview of a single query or a group of queries from the query log or comparison of two;
- flame graph for a query or a group of queries;
Note: similar dashboard can be implemented in Grafana, but the specialized UI must have strictly better ergonomics and capabilities.
Automatic cluster configuration
Add arguments to clickhouse install to automatically configure a cluster with shards and replicas.
Example:
./clickhouse install --shard 1 --join 1.2.3.4
It will install ClickHouse with embedded Keeper, create the first replica or join a cluster, configure a Replicated database, replicated ACL, and replicated UDF storages.
As an extension of this task, ClickHouse can integrate with AWS API to automatically create a new VM, install itself, open the required ports and add it to the cluster.
API for query rewrite
Probablistic filtering for subqueries
@Lloyd-Pottiger and, independently, @nadya002
The idea of this task is to provide a probabilistic alternative for the IN (subquery) operator using bloom filter, counting bloom filter (to check for elements likely appeared multiple times), cuckoo filter, quotient filter, vacuum filter, and to compare all these algorithms.
The applications are cohort analysis and antifraud.
Extensions for ORDER BY operator
- topological sorting: Topological Sorting #34343
- natural version sorting;
Probabilistic data structures for approximate (range) filtering in ClickHouse queries.
Contact: @rschu1ze
Booked by: @nautaa
For example SuRF: Practical Range Query Filtering with Fast Succinct Tries (2018) and Proteus: A Self-Designing Range Filter (2022)
This intern/student shall implement SuRF, Proteus, or a related probabilistic range filter as a new index type in ClickHouse. Such filters are similar to Bloom filters but unlike Bloom filters, they support range predicates, i.e. <, >, <=, >= and BETWEEN (see the table at https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#functions-support).
Userspace Page Cache
Contact: @rschu1ze
Booked in a revised form by @NikitaEvs.
✔️ Entropy-learned Hashing
Contact: @rschu1ze
Try out Entropy-Learned Hashing Constant Time Hashing with Controllable Uniformity (2022) in ClickHouse's hash aggregation
EDIT: The initial idea was to make hashing in ClickHouse's aggregation operator less costly. Turned out that it is quite lightweight already. Entropy-based hashing could still be implemented as a SQL function in ClickHouse. However, the scope of this task would be a medium difficulty coursework project but not a bachelor's thesis.
Booked by @Dmitry909
Timeseries Anomaly/Outlier Detection
Contact: @rschu1ze
Implement and evaluate one/multiple algorithms for time series anomaly detection in ClickHouse, see here.
Streaming Queries
@kate1mag @aaapetrenko and @nooblose
Secret task.
✔️ ClickHouse as a Python Module
Implemented here: https://github.com/chdb-io/chdb