A stroll through synsuck
the part of the LiveJournal source code which handles the retrieval
of RSS feeds, conversion to journal entries, and storage of those
entries in the LiveJournal database.
Understanding this document is not required of Syndication
volunteers. I expect that most volunteers will not use this,
or maybe will find the overview interesting. It's a reference I
wrote up to help me diagnose one problem a long time ago, and I
figured I might as well make it public.
(That said, I encourage anyone to comment with questions, requests
for clarifications, and so forth, even if you think your question
is silly.)
The current version of synsuck.pl can be obtained by
clicking on the highest version number in its entry
in the CVS repository browser:
http://cvs.livejournal.org/browse.cgi/livejournal/bin/maint/synsuck.pl
Walkthrough of bin/maint/synsuck.pl v1.25
Last modified: 2003-08-12 02:25 UTC
Overview
For each syndicated account:
- request RSS file
- if response is larger than 150KB, exit and try again in 60 minutes
- check for misidentified character set and repair
- parse RSS
- take most recent 20 items of feed in reverse order
- delete any entries older than two weeks (max friends page time)
from LJ database
- check to see if feed uses different <link>s in each entry;
if so, pull up the last set of <link>s from the database for
later comparison
- For each item in a feed:
- Check to see if this exact item (all fields) has been
seen before; if so, skip to next item
- Store fields from XML into fields of an LJ entry
- Check if the <link> already exists in the database
- If so, perform 'editevent' to edit entry; otherwise,
perform 'postevent' to post new entry
- Check to see if this exact item (all fields) has been
- Update account name, website, bio with global title, URL,
description from RSS feed
- Decide when to poll the feed next
- Update count of number of users reading this syndicated account
Detail
- Lines 4-9:
loads required Perl libraries, declares global variables.- 11:
The synsuck subroutine is entered into the hash which contains all of LiveJournal's maintenance tasks. This subroutine and the others in that hash are alled from bin/ljmaint.pl, which is launched as a periodic job via
cron.- 20:
We retrieve all users whose statusvis is 'V' (syndicated) and whose
'checknext' field is in the past (ie, where checknext has come up
since the last run).- 25-end (312):
This is the main loop of synsuck.pl, which executes once for every
syndicated account returned from the previous query (l. 20).- 27-33:
Create an anonymous subroutine which sets the 'lastcheck' field
in the 'syndicated' table to the current time, and the 'checknext'
field in that table to the current time plus N minutes where N
is supplied as an argument, and records the status provided as
an argument, for the syndicated user being looked at. Note that
this subroutine is not called yet, only created, and a reference
to it is stored (in $delay) for later use. Below, this
subroutine will be referred to as the scheduler.- 37-47:
Requests the feed URL with libwww-perl, including HTTP headers to
request the document only if it has been modified since the last
request and if the entity tag has changed since the last request.
(This way, per HTTP 1.1 we get new data unless both conditions
are true -- i.e., we increase our chances of getting new data from
a recalcitrant caching proxy along the way.)We check the size of the response on the way via a subroutine that
is called on the incoming data. If the response is larger than
150*1024 bytes we discard it immediately and set the $too_big flag.- 48:
If $too_big is set, we tell the scheduler to check the
feed again in 60 minutes and store "toobig" as the status, and go on to
the next feed.- 50-55:
If we received HTTP 304 (Not Modified), we call the scheduler
subroutine with a status of 'notmodified' and a delay of
60 minutes if the feed has readers, or 1 day if the feed has no
readers.- 57-66:
Comment justifying the subsequent block that deserves reproduction
here:# WARNING: blatant XML spec violation ahead... # # Blogger doesn't produce valid XML, since they don't handle encodings # correctly. So if we see they have no encoding (which is UTF-8 # implictly) but it's not valid UTF-8, say it's Windows-1252, which # won't cause XML::Parser to barf... but there will probably be some # bogus characters. [...]
- 67-73:
Code referred to by above comment, which looks for "<?xml"
and "encoding="
and stores the value of "encoding"; if not present and data does
not appear to be UTF-8 (via LJ::is_utf8()), edits the content
in-place to contain "encoding='windows-1252'".- 75-83:
"Another hack" which checks for Windows smart quotes in a document
which claims to be UTF-8; if present, edits the existing "encoding="
string to read "encoding='windows-1252'".- 85-89:
Parse XML with XML::RSS::parse(), trapping exceptions, and storing
the resulting RSS object.- 90-99:
If an exception is raised, call the scheduler subroutine with the status
"parseerror" and a delay of three hours, clean up the error returned by
XML::RSS::parse(), and store as the user's "rssparseerror"
userprop, then go to the next feed.- 102:
Check that the "items" method of the RSS object returns an array
(of items in the feed). If it doesn't, call the scheduler subroutine
with a status of "noitems" and a delay of 3 hours, then go to
the next feed.- 104:
Store a copy of the items in the RSS feed in the reverse order
that they appear in the feed.- 107:
Take only the 20 bottommost items from the feed (i.e., the first
20 items in our stored copy).- 111-116:
Connect to the LJ database. If connection fails, call the scheduler
subroutine with status "nodb" and delay of 15 minutes, then
go to the next feed.- 118-130:
Retrieve ids of all articles in the database for this account
user which are older than MAX_FRIENDS_VIEW_AGE (default 2 weeks) and
delete them from the database.- 132-143:
Try to determine if the <link> field of each item can be expected
to be unique; for each item that has a link field, store the link
field, and if we've stored more than one link field, assume that
links can be expected to be unique.- 149-158:
If the links can be expected to be unique, pull all of the
links of already-stored entries from the database, and store
those links with their entry IDs.- 163-249:
Iterates over each item in the feed to possibly store it:- 169-171:
Remove Perl's internal UTF-8 flag from the title, link and
description fields of the item, so that Perl does not know it is
UTF-8 even if it is. This prevents Perl from performing UTF-8
conversions for us even when we're doing it ourselves. These
automatic conversions are what introduced the recent
double-encoding bug where one syndicated entry would break UTF-8
encoding for a whole friends page.- 173-178:
Take an MD5 hash (via LJ::md5_struct) of all of the fields
and values in the item. Check to see if that digest has already
been stored in the synitem table for this syndicated account; if
it has, assume we've already seen this item and go on to the next
item. If it hasn't, store the MD5 hash and account ID in the
synitem table.- 180:
Now that we believe this is a new item, increment a counter,
$newcount (which is specific to this account).- 182-183:
Remove whitespace from the beginning and end of the
<description> field.- 186-190:
If the item has a <link> field, store it in an <a href> in
a temporary variable, $htmllink.- 192-209:
- Set the fields of the LJ "postevent". Of interest:
- 'Subject' is <title>
- 'Event' is $htmllink prepended to <description>
- The 'syn_link' property is <link>
- 211-214:
If the <description> contains <p> or <br>, assume that the
contents are preformatted, and set the "don't autoformat"
property.- 216-235:
If this <link> appears in the list (from l.149) of existing
<link>s for the feed, decrement $newcount, and prepare
to perform an "editevent" instead of a "postevent". Maintain the original
posting time.- 237-239:
Perform the postevent or editevent, using the current time as
post time.- 240-244:
If successful, ensure that one second elapses between the time
the post was posted and now, otherwise wait one second. This
prevents two articles from being posted at the same time, so
the order in which they appear in the friends list matches the
order they appeared in the feed.- 245-249:
If the postevent or editevent failed, set $errorflag.[Repeat from 163 for each item.]
- 245-255:
If any postevent or editevent for this feed failed (ie, $errorflag
is set), call the scheduler subroutine with a status of "posterror"
and a delay of 30 minutes, and go to the next feed.- 257-288:
Update syndicated account's userinfo:- 258:
Load 'url' and 'urlname' property from existing userinfo.- 260-264:
If the RSS feed has a <title>, replace the syndicated account's
name with the feed <title> and set the "urlname" user property
to the feed <title>.- 266-269:
If the RSS feed (not an individual item) has a <link> set the
"url" user property to the <link>.- 271-287:
If the RSS feed (not an individual item) has a <description>,
set the user's bio to the <description>, otherwise record that
the user has no bio.- 293-297:
Decide when to poll the feed next. And I quote:# FIXME: this is super lame. (use hints in RSS file!)
If there were new articles this time, interval is 30 minutes
and status is "ok"; otherwise interval is 60 minutes and
status is "nonew". Call the scheduler subroutine with the
appropriate values.- 299-304:
Update reader count if there were new articles, or if it has
never been updated, but otherwise leave it at current value.- 307:
If there are no readers, use the scheduler subroutine to
change next poll interval to 1 day.- 309-311:
Store last-modified date, entity tag, status, and interval to
next poll in database.
