Skip to content

boilingdata/data-taps-fluentbit-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FluentBit | Web Analytics | PostgreSQL CDC | REST API | OpenSearch/ES | AWS Lambda Telemetry

FluentBit --> Data Tap --> S3 Parquet

Our building blocks:

  • collectd: "Systems statistics collection daemon"
  • fluentbit: "A super fast, lightweight, and highly scalable logging and metrics processor and forwarder. It is the preferred choice for cloud and containerized environments"
  • Data Taps: Managed hyper scale HTTP URL for posting newline JSON at any scale, SQL transformation, and landing to S3 with optimal format.

Data Taps makes a perfect end point for colllecting logs, metrics, events etc. efficiently and in scale to S3 -- a single smallest AWS Lambda, so you don't have to worry about clusters or costs. Purely from data storage cost perspective Data Taps is at least 50-80x more cost efficient than Elasticsearch with EBS volumes (assuming EBS is 100% utilised which never is the case of a healthy system, rather closer to 50%). Data Taps brings your data to S3 in de-factor compressed Parquet format, where you want them to land in the end anyway.

Building Blocks

1. collectd

collectd daemon with collectd.conf as input source for Fluent Bit client below.

brew install collectd

2. FluentBit

Install FluentBit. The fluent-bit.yaml configuration file includes collectd input and Data Tap HTTP(s) output with x-bd-authorization token.

brew install fluent-bit

3. Data Tap

A Data Tap is a single AWS Lambda function with Function URL and customized C++ runtime embedding DuckDB. It uses streaming SQL clause to upload the buffered HTTP POSTed newline JSON data in the Lambda to S3, hive partitioned, and as ZSTD compressed Parquet. You can tune the SQL clause your self for filtering, search, and aggregations. You can also set the thresholds when the upload to S3 happens. A Data Tap runs already very efficiently with the smallest arm64 AWS Lambda, making it the simplest, fastest, and most cost efficient solution for streaming data onto S3 in scale. You can run it on your own AWS Account or hosted by Boiling Cloud.

You need to have BoilingData account and use it to create a Data Tap. The account is used to fetch authorization tokens which allow you to send data to a Data Tap (security access control). You can also share write access (see the AUTHORIZED_USERS AWS Lambda environment variable) to other BoilingData users if you like, efficiently creating Data Mesh architectures.

Prerequisites

  1. You need a Data Tap on your AWS Account. You can follow these instructions. https://github.com/boilingdata/data-taps-template/tree/main/aws_sam_template

  2. Export fresh Tap token as TAP_TOKEN environment variable and TAP_URL env var as the Tap ingestion URL endpoint by using bdcli (see previous step).

# 1. You will get the TAP URL from the Tap deployment you did in the first step
export TAP_URL='https://...'
# 2a. If you send to your own Data Tap (sharing user is the as your BoilingData username)
export TAP_TOKEN=`bdcli account tap-client-token --disable-spinner | jq -r .bdTapToken`
# 2b. If you send to somebody else's Data Tap, replace "boilingSharingUsername"
export TAP_TOKEN=`bdcli account tap-client-token --sharing-user boilingSharingUsername --disable-spinner | jq -r .bdTapToken`

Start Collecting Statistics

Start collectd. It requires root privileges to collect CPU statistics.

# 1. start collectd
cp collectd.conf /opt/homebrew/etc/collectd.conf
sudo /opt/homebrew/opt/collectd/sbin/collectd -f -C /opt/homebrew/etc/collectd.conf
# 2. start fluent-bit that gets the collectd statistics and sends to Data Tap
./setup-config.sh # setups fluent-bit.conf
/opt/homebrew/bin/fluent-bit -c fluent-bit.conf

Checking Data

You can check the uploaded Parquet files in your S3 bucket and download them to your local laptop and get a glimpse into them with e.g. DuckDB.

aws s3 sync s3://YOURBUCKET/datataps/ d/
duckdb -s "SELECT COUNT(*) FROM parquet_scan('./d/**/*.parquet');"

Alternatively you can run the analytics on the cloud side with BoilingData. For example, a one-off SQL query with bdcli.

bdcli api query  -s "SELECT COUNT(*) FROM parquet_scan('s3://YOURBUCKET/datataps/');"

About

Run Collectd and FluentBit for sending logs/metrics to Data Taps with scale

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages