Aggregation of RSS Feeds

J

jamesjacobyu

Not sure if this is the best place to ask this question, but here goes:

I'm programming an aggregator that keeps track of a large number of
feeds (basically an rss reader). The problem is, I want an automatic
way to know when sites have updated, so my program doesn't have to keep
checking all the feeds to see if they've updated.

I know that there are ping servers that blogs ping when they've
updated, like blo.gs and weblogs.com. But, I am unsure about how to use
this to my advantage. Can anyone point me in the right direction?

Also, does anyone know if RSS XML docs have a way to query them for the
last updated time? That way, I don't actually have to download the
whole doc to see if its updated (I know HTML has this capability).

Thanks,
James
 
A

Andy Dingley

I'm programming an aggregator that keeps track of a large number of
feeds (basically an rss reader). The problem is, I want an automatic
way to know when sites have updated,

There are several ways.

Register with an update service (look at "clouds" in the Winer specs
(RSS 0.92/0.94 and RSS 2.0)

Ask for the RSS document by HTTP and look at the headers received. This
often doesn't work, because the "Last modified" date is set to the date
of serving the document by badly coded servers. You might also be able
to use a HTTP HEAD command rather than a GET, so you don't have to
download the whole document (rarely implemented though).

Download an RSS 2.0 document, or an RSS 1.0 document that uses the
Syndication module, and look at the suggested time to revisit after.

Download the document, hash it to a signature (SHA1 or MD5 is easy to
find code for, but you might want to normalise the XML first). When the
signature changes, assume it's a changed document. Develop your own
"revisit after" estimation, based on how often the document actually
changes. Randomly vary the time your server revisits, so as to track
update frequencies that vary over time (many blogs are quite
unpredictable).

Some combination of the last two techniques.

Just download the document anyway.
 
A

andreas_is_here

The "Last Modified" is set properly by any server wishing to survive.
Since a typical aggregator asks for updates once an hour, not setting
this header means wasting tons of bandwidth.

So you should keep this "Last Modified" time, and put it in the request
as "If-Modified-Since". If the feed hasn't changed, any sane server
will respond with "Not Modified", and you're both happy.

Besides, keep an eye on the ttl of the feed as well as the skiphours
element.

Good luck.
Andreas.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top