rss parsing error.

Y

Young Gyu Park

[Note: parts of this message were removed to make it a legal post.]

At these days, I try to parse 'http://www.forbes.com/news/index.xml' using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

But in the google rss reader, they process correctly without any problem.
This is the point I wonder how they can make it happen, while I can't.

please help me out to narrow the gap between me and google ^.^

be a happy day.
 
J

Juvenn Woo

Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I'm
guessing Google Reader uses it in the backend.
As far as I know, there's no equivalent ruby package for that.

Regards,
 
K

Kouhei Sutou

Hi,

In <[email protected]>
"Re: rss parsing error." on Tue, 14 Jul 2009 00:26:37 +0900,
Juvenn Woo said:
Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I'm
guessing Google Reader uses it in the backend.
As far as I know, there's no equivalent ruby package for that.

The RSS can be parsed with the bundled RSS Parser.
We doesn't need to use Universal Feed Parser.
 
Z

zotium

Has anybody tried comparing Feedzirra vs Universal Feed Parsers
performance? Which is faster when processing thousands of feeds?
 
G

G_ F_

Young said:
At these days, I try to parse 'http://www.forbes.com/news/index.xml'
using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

Glancing at the output of their feed I see no malformed RSS. I do see
them "exercising some options" that most feeds don't, such as embedding
CDATA in the link tags.

Using Nokogiri to parse this feed is easy:

#!/usr/bin/env ruby -wKU

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.forbes.com/news/index.xml'
xml = Nokogiri::XML(open(url))

puts "Feed title: #{ (xml%'title').content }"
puts "Feed description: #{ (xml%'description').content }"
puts "Feed link: #{ (xml%'link').content }"

# get the first item
item = (xml/'item').first
puts "Item title: #{ (item%'title').content }"
puts "Item link: #{ (item%'link').content }"
puts "Item pubDate: #{ (item%'pubDate').content }"
puts "Item description: #{ (item%'description').content }"
puts "Item author: #{ (item%'author').content }"

Not all feeds are this straightforward or well constructed. That's where
using a pre-built library to parse comes in handy but I haven't found
one yet that handles everything out there correctly. Even Google's
reader gets it wrong on some malformed feeds.

Aaron Patterson (AKA tenderlove) has done a great job with Nokogiri.
I've tested a lot of feeds and seen occasions where the built-in RSS
reader and other libraries puked or spun off and never returned. I've
run into feeds that caused Hpricot to be unable to strip broken HTML
embedded inside the descriptions, but Nokogiri was able to handle it.
So, if you can't get a library to do what you want, jump in with
Nokogiri and give it a try.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,919
Members
47,458
Latest member
Chris#

Latest Threads

Top