Spidering the web to find RDF

M

Mark Watson

Last year, I did an experiment of allowing a very polite
web spider run for a few days trying to find RDF markup
embedded in web pages. I found close to zero RDF - not
encouraging!

I a recent post, I compalined about not being able to
embed RDF in XHTML (at least no standard way to do it
and still pass th W3C XHTML validator). Another poster
(Jeen Broekstr) provided a good example of simply
linking to a RDF file at the same site.

I was concerned about spiders being able to find
links to RDF because there is no standard for this,
then a few minutes ago I had one of those "Duh!" experiences:

A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf". If such a link is found, assume that
it decribes to the page linking it.

Anyway, I will try my experiment again (when I have
time to set it up) and report the results. I hope that
lots of people link to separate RDF files on their sites
and my results will be better than last year when I
only looked for embedded RDF.

-Mark
 
N

Nick Kew

one of infinite monkeys said:
A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf".

Ahem ... the last few characters of a URL have absolutely no significance
except by convention. A spider that did that would be broken.

It could, however, look for links with the type="application/rdf+xml"
attribute. It would find a couple in my pages, for instance.
If such a link is found, assume that
it decribes to the page linking it.

Wouldn't it be better to believe the RDF concerning its own subject?
only looked for embedded RDF.

I played with embedding RDF (for automatically-generated reports),
but abandoned the idea as a nonstarter.
 
J

Jeen Broekstra

Nick said:
one of infinite monkeys at the keyboard of


Ahem ... the last few characters of a URL have absolutely no
significance except by convention. A spider that did that
would be broken.

It could, however, look for links with the
type="application/rdf+xml" attribute. It would find a couple
in my pages, for instance.

That would, however, only work if the web server from which the
file is hosted is aware of this mime type. I don't know if Apache
comes preconfigured with it these days but I'll bet that older
versions won't spot it (for example, my rdf file would not be
found since the department web server serves it as text/plain).

You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Jeen
 
N

Nick Kew

one of infinite monkeys said:
That would, however, only work if the web server from which the
file is hosted is aware of this mime type.


Nope. I said attribute.
I don't know if Apache
comes preconfigured with it these days but I'll bet that older

Neither do I; in any case it wouldn't do anything for the above example
which I deliberately (and perfectly legitimately) ended with .html
The server should of course serve it with the correct MIME type,
but that's another issue.
You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Even if .rdf gets something, it'll miss out on lots of .cgi, .php,
..xml and other things. It's simply broken.

Relying on the attribute will also miss out on many instances.
It's no more than a more correct thing than ".rdf" to look for
in (x)html links.
 
M

Mark Watson

Jeen Broekstra said:
You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

So, I will use both Nick's and Jeen's ideas.

Thanks,
Mark
 
J

Jeen Broekstra

Nick said:
infinite monkeys at the keyboard of Jeen Broekstra



Nope. I said attribute.
<link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">

Blimey. My bad, I completely misread your post.

Jeen
 
N

Nick Kew

one of infinite monkeys said:
It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

My previous post was just a correction to something you said, which I
felt called for correction because it so often leads to confusion.

My *practical" suggestion would be to send HEAD requests from the spider
to ascertain the type of any URL before actually fetching it. Then fetch
HTML and XHTML pages to spider for more links, and RDF pages for your
collection.

I happen to have spidering software that'll do all that - among other
things:) Though I have the feeling you may not have the budget for it,
given the experimental nature of your task.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top