-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Hacker News Dataset #29693
Copy link
Copy link
Open
Labels
comp-documentationDocumentation (docs, examples, READMEs).Documentation (docs, examples, READMEs).dataset
Description
- Download data from the official API:
https://github.com/HackerNews/API
seq 0 2990 | xargs -P100 -I{} bash -c '
BEGIN=$(({} * 10000));
END=$((({} + 1) * 10000 - 1));
echo $BEGIN $END;
curl -sS --retry 100 "https://hacker-news.firebaseio.com/v0/item/[${BEGIN}-${END}].json" | pv > "hn{}.json"'
It will take about a day. The size of files is 12.8 GB.
As an alternative, you can download prepared files from http://files.pushshift.io/hackernews/
But this source is abandoned and does not update.
- Cleanup the download:
for i in *.json; do echo $i; sed 's/{/\n{/g' $i | grep -v -P '^null$' > ${i}.tmp && mv ${i}.tmp ${i}; done
find . -size 40000c | xargs rm
grep -l -o -F '}null' *.json | xargs sed -i -r 's/}(null)+/}/g'
- Create table:
CREATE TABLE hackernews
(
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree ORDER BY id
- Insert data:
clickhouse-client --query "INSERT INTO hackernews FROM INFILE '*.json' FORMAT JSONEachRow" --progress
24 seconds, 1 202 257 rows/sec.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
comp-documentationDocumentation (docs, examples, READMEs).Documentation (docs, examples, READMEs).dataset