Skip to content

Hacker News Dataset #29693

@alexey-milovidov

Description

@alexey-milovidov
  1. Download data from the official API:

https://github.com/HackerNews/API

seq 0 2990 | xargs -P100 -I{} bash -c '
    BEGIN=$(({} * 10000));
    END=$((({} + 1) * 10000 - 1));
    echo $BEGIN $END;
    curl -sS --retry 100 "https://hacker-news.firebaseio.com/v0/item/[${BEGIN}-${END}].json" | pv > "hn{}.json"'

It will take about a day. The size of files is 12.8 GB.

As an alternative, you can download prepared files from http://files.pushshift.io/hackernews/
But this source is abandoned and does not update.

  1. Cleanup the download:
for i in *.json; do echo $i; sed 's/{/\n{/g' $i | grep -v -P '^null$' > ${i}.tmp && mv ${i}.tmp ${i}; done
find . -size 40000c | xargs rm
grep -l -o -F '}null' *.json | xargs sed -i -r 's/}(null)+/}/g'
  1. Create table:
CREATE TABLE hackernews
(
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree ORDER BY id
  1. Insert data:
clickhouse-client --query "INSERT INTO hackernews FROM INFILE '*.json' FORMAT JSONEachRow" --progress

24 seconds, 1 202 257 rows/sec.

  1. The data is available in Playground: https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUIHRvWWVhcih0aW1lKSBBUyBkLCBjb3VudCgpIEFTIGMsIGJhcihjLCAwLCAxMDAwMDAwMCwgMTAwKSBGUk9NIGhhY2tlcm5ld3MgR1JPVVAgQlkgZCBPUkRFUiBCWSBk

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions