A few weeks ago I was browsing the Internet and while looking over the twitter page I started thinking what does it involve to build a twitter event feed on my own. Naively it just looks like the tweets are imported in a few big database tables (tweets, users) and they get distributed to the users of the website/twitter mobile clients.
So I’ve started to work on building the website and I’ve envisioned a few independent modules (building bricks) for the system :
– a spring boot standalone application dedicated to import real tweets from Twitter via a twitter stream into the Postgres tables
– a spring boot REST service application used to expose the tweets that were imported on the database
– a spring MVC web application which can be used to display the tweets with an infinite scrolling ability (like Twitter page does also)
While working on building the website I thought that it would be interesting to build some kind of statistic on top of the imported tweets. So I’ve decided to add the functionality of displaying a statistic of the popular words gathered in the last 5 minutes. For implementing this feature I’ve thought of using redis sorted sets functionality for having structuring the word statistics. For displaying the statistics, I found the d3js pack layout to express easily in a visual fashion this statistic.
Below is the current outcome of the functionalities exposed for the website.
The source code for the site is available on Github :
https://github.com/mariusneo/twitter-feed
In the following part of this post I’ll describe some of the things that I find relevant to think of while building a similar functionality :
Tweets importer application
This application depends heavily on twitter4j library and does simply open a Twitter stream on a set of given topics (cities from the german speaking countries in this case) and each of the tweets gets imported into the Postgres database :
// ...
StatusListener listener = new StatusListener() {
public void onStatus(Status status) {
tweetImporterService.importTweet(status);
}
// ....
};
TwitterStream twitterStream = new TwitterStreamFactory().getInstance();
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
twitterStream.setOAuthAccessToken(new AccessToken(accessToken, accessTokenSecret));
twitterStream.addListener(listener);
FilterQuery capitalsQuery = new FilterQuery();
capitalsQuery.track(new String[]{"Wien", "wien", "Berlin", "berlin", "Bern", "bern", "Vaduz", "vaduz"});
twitterStream.filter(capitalsQuery);
One important thing to note here is that the import operation of the tweets happens into the databse should happen asynchronously in order to be able to process in parallel the incoming tweets:
@Async
@Transactional
public void importTweet(Status status){
// ...
}
Service application
Spring boot and Jersey make it very easily nowadays to expose REST webservice functionality in java. One of the problems that arise when dealing with feeds of data is that their pagination.
The classic listing of items in a table multiple pages, specific for business web applications, can’t be used effectively for Twitter, because there are constantly new tweets added to the start of the event feed. When changing the page from the first to the second and then, after a few seconds, again back to the first page the user would be confused because the content of the first page changed (because new tweets arrived). For this reason, an infinite feed fits much better for displaying the flow of data.
The details on how the API is to be built are described with nice examples by the Twitter engineers on this page :
https://dev.twitter.com/rest/public/timelines
On the case of my service application, I’ve implemented service methods that use the following parameters:
– since_id : for retrieving the tweets that occured since the tweet with this id
– max_id : for retrieving the previous tweets that were entered before the tweet with this id
See the implementation of the service here :
In order to provide efficient methods for retrieving content with very little duplication in the feed, the addition of a method that combines since_id and max_id parameters would make sense.
Web application
The web application is a classic MVC application with separation between the view, I’ve used Thymeleaf, and the controller.
Most of the logic of displaying new tweets or retrieving the previous tweets from the flow are done within Javascript with jQuery routines and AJAX JSONP calls made directly to the service application.
Relevant source code :
https://github.com/mariusneo/twitter-feed/blob/master/web/src/main/resources/static/js/feed.js
One TODO which I still have on my list is the consistent retrieval of the tweets that can occur when having to deal with more tweets than the defined bucket limit.
The solution for this issue is covered in the explanations for the twitter timeline API :

Word statistics functionality
Every time a new tweet gets imported into the database, a database trigger will take care of creating into a separate table an entry for having the tweet words processed.
CREATE OR REPLACE FUNCTION insert_tweet_into_count_words_tweets() RETURNS trigger AS $BODY$ BEGIN INSERT INTO count_words_tweets(tweet_id,created_at, updated_at, status) VALUES(NEW.id,now(),now(),'INITIAL'); RETURN NEW; END; $BODY$ LANGUAGE plpgsql;
Each tweet will be processed by a quartz job running every second which splits the tweet into words and sends the words towards the corresponding redis set and into the redis sorted set containing the total word statistic.
The site displays currently the words from the last 5 minutes, so I’ve decided to keep 6 sorted set buckets and to fill them in a circular fashion.
Another quartz job will kick in at the beginning of each minute and will subtract from the totals the content of the obsolete sorted set bucket and subsequently it will clear the content of this obsolete bucket. At the beginning of the next minute the bucket which was in the previous minute cleared will be filled with the words from the tweets corresponding to the current minute. These operations are executed in a circular fashion ensuring in this way that the content of words.totals sorted set contains always only the sorted statistic of the words introduced in the last five minutes.
Note in the image above that the words in the sorted sets are sorted by their values (in this case the number of times they were used within the tweets created within a specified minute).
The implementation details for this functionality can be found here :
https://github.com/mariusneo/twitter-feed/tree/master/jobs/count-words-job
Interraction with Redis is done by using JedisPool resources offered by the library jedis
@Configuration
public class JedisConfig {
@Value("${redis.datasource.url}")
private String redisUrl;
@Bean
public JedisPool jedisPool() {
return new JedisPool(redisUrl);
}
}
// usage of the Jedis resources
Jedis jedis = jedisPool.getResource();
try {
// ... jedis stuff
} finally {
/// ... it's important to return the Jedis instance to the pool once you've finished using it
jedisPool.returnResource(jedis);
}
Getting at any time the statistic of the most popular 50 words so that they get displayed is very easily achieved in the following fashion :
Jedis jedis = jedisPool.getResource();
try {
Set<Tuple> range = jedis.zrevrangeWithScores("words.totals", 0, 50);
Map<String, Integer> result = new LinkedHashMap<>();
range.stream().forEach(t -> result.put(t.getElement(), (int) t.getScore()));
return result;
} finally {
/// ... it's important to return the Jedis instance to the pool once you've finished using it
jedisPool.returnResource(jedis);
stopWatch.stop();
}
Although far from fully functional, this application can be used as a tutorial application on building a website with a greater complexity than the one we come across when reading a general technology tutorial.




