FluentBit | Web Analytics | PostgreSQL CDC | REST API | OpenSearch/ES | AWS Lambda Telemetry
Our building blocks:
- collectd: "Systems statistics collection daemon"
- fluentbit: "A super fast, lightweight, and highly scalable logging and metrics processor and forwarder. It is the preferred choice for cloud and containerized environments"
- Data Taps: Managed hyper scale HTTP URL for posting newline JSON at any scale, SQL transformation, and landing to S3 with optimal format.
Data Taps makes a perfect end point for colllecting logs, metrics, events etc. efficiently and in scale to S3 -- a single smallest AWS Lambda, so you don't have to worry about clusters or costs. Purely from data storage cost perspective Data Taps is at least 50-80x more cost efficient than Elasticsearch with EBS volumes (assuming EBS is 100% utilised which never is the case of a healthy system, rather closer to 50%). Data Taps brings your data to S3 in de-factor compressed Parquet format, where you want them to land in the end anyway.
collectd daemon with collectd.conf as input source for Fluent Bit client below.
brew install collectdInstall FluentBit. The fluent-bit.yaml configuration file includes collectd input and Data Tap HTTP(s) output with x-bd-authorization token.
brew install fluent-bitA Data Tap is a single AWS Lambda function with Function URL and customized C++ runtime embedding DuckDB. It uses streaming SQL clause to upload the buffered HTTP POSTed newline JSON data in the Lambda to S3, hive partitioned, and as ZSTD compressed Parquet. You can tune the SQL clause your self for filtering, search, and aggregations. You can also set the thresholds when the upload to S3 happens. A Data Tap runs already very efficiently with the smallest arm64 AWS Lambda, making it the simplest, fastest, and most cost efficient solution for streaming data onto S3 in scale. You can run it on your own AWS Account or hosted by Boiling Cloud.
You need to have BoilingData account and use it to create a Data Tap. The account is used to fetch authorization tokens which allow you to send data to a Data Tap (security access control). You can also share write access (see the AUTHORIZED_USERS AWS Lambda environment variable) to other BoilingData users if you like, efficiently creating Data Mesh architectures.
-
You need a Data Tap on your AWS Account. You can follow these instructions. https://github.com/boilingdata/data-taps-template/tree/main/aws_sam_template
-
Export fresh Tap token as
TAP_TOKENenvironment variable andTAP_URLenv var as the Tap ingestion URL endpoint by using bdcli (see previous step).
# 1. You will get the TAP URL from the Tap deployment you did in the first step
export TAP_URL='https://...'
# 2a. If you send to your own Data Tap (sharing user is the as your BoilingData username)
export TAP_TOKEN=`bdcli account tap-client-token --disable-spinner | jq -r .bdTapToken`
# 2b. If you send to somebody else's Data Tap, replace "boilingSharingUsername"
export TAP_TOKEN=`bdcli account tap-client-token --sharing-user boilingSharingUsername --disable-spinner | jq -r .bdTapToken`Start collectd. It requires root privileges to collect CPU statistics.
# 1. start collectd
cp collectd.conf /opt/homebrew/etc/collectd.conf
sudo /opt/homebrew/opt/collectd/sbin/collectd -f -C /opt/homebrew/etc/collectd.conf
# 2. start fluent-bit that gets the collectd statistics and sends to Data Tap
./setup-config.sh # setups fluent-bit.conf
/opt/homebrew/bin/fluent-bit -c fluent-bit.confYou can check the uploaded Parquet files in your S3 bucket and download them to your local laptop and get a glimpse into them with e.g. DuckDB.
aws s3 sync s3://YOURBUCKET/datataps/ d/
duckdb -s "SELECT COUNT(*) FROM parquet_scan('./d/**/*.parquet');"Alternatively you can run the analytics on the cloud side with BoilingData. For example, a one-off SQL query with bdcli.
bdcli api query -s "SELECT COUNT(*) FROM parquet_scan('s3://YOURBUCKET/datataps/');"