- Have a Twitter developer account to access the v2 of their API, check: https://developer.twitter.com/en/docs/tutorials/stream-tweets-in-real-time.
- Get the bearer_token from twitter developer portal and put it on an
.envfile on the repo root, eg:
BEARER_TOKEN=XXXXXXXXXXX
- Create stream rules:
- First compile the docker image:
docker build -t twitter_stream . - Then you can run some utilities for managing your stream rules:
- list rules:
docker run --rm twitter_stream --bearer-token XXXX list-rules - create rule:
docker run --rm twitter_stream --bearer-token XXXX create-rule "your_rule_goes_here" -t some_tag - delete rule:
docker run --rm -it twitter_stream --bearer-token XXXX delete-rule
- list rules:
- Rule example: "python (machine OR deep) learning -is:retweet lang:en". This rule will stream any tweet with the words "python", "machine or deep" and "learning" that are not retweets and are in english. To add this rule under a tag like "data-science" run:
docker run --rm twitter_stream --bearer-token XXXX create-rule "python (machine OR deep) learning -is:retweet lang:en" -t data-science - For more info on how to create rules check: https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule.
- First compile the docker image:
Once all desired rules are setup is time to stream. The easiest way to have everything installed and running is with docker-compose installed.
But, before running docker-compose create a folder called elasticsearch_data as otherwise things will fail.
By running the docker-compose file 4 services will start:
- elasticsearch: An instance of Elastic Search
- zmq_publisher: Obtains data from twitter stream and publishes messages via ZeroMQ (source code in twitter_stream/examples/zmq_publisher.rs).
- zmq_elasticsearch: Receives messages from the zmq_publisher and sends them to the elasticsearch instance (source code in twitter_stream/examples/zmq_elasticsearch.rs).
- kibana: An instance of Kibana for easy data exploration.
(If you don't want to use docker, you will need rust and cargo installed to compile the binaries in twitter_stream folder by running make all)
To see the state of the stream run: docker logs twitter_utils_zmq_elasticsearch_1 -f
To access Kibana go to http://localhost:5601 (the first time may take a couple of minutes to start), then you need to need to add the Elastic Search index in Kibana:
- Go to "Discover" link in the side bar:
- Click on "Create index pattern":
- Type the index patter name: "tweets*":
- Select the time field and click on "Create index pattern":
After this you should be able to use Kibana tools to explore the data (check "Visualize Library" on the side bar).
To dump Elastic Search index data to a file you can use elasticsearch-dump (install with npm install elasticdump -g):
elasticdump --input=http://localhost:9200/tweets --output=data.jsonl --type=data --limit=5000Note that this tool will give you a line-delimited JSON file
One problem with this file is that it includes Elastic Search metadata.
To extract only the tweet data (the "_source" field), we can use jq (in ubuntu you can install with sudo apt-get install jq):
jq -c '._source' data.jsonl > final_data.jsonl
# or if you want a regular json file:
jq -s '[.[]._source]' data.jsonl > final_data.json


