Spidering the web to find RDF

Mark Watson · Oct 2, 2003

Last year, I did an experiment of allowing a very polite
web spider run for a few days trying to find RDF markup
embedded in web pages. I found close to zero RDF - not
encouraging!

I a recent post, I compalined about not being able to
embed RDF in XHTML (at least no standard way to do it
and still pass th W3C XHTML validator). Another poster
(Jeen Broekstr) provided a good example of simply
linking to a RDF file at the same site.

I was concerned about spiders being able to find
links to RDF because there is no standard for this,
then a few minutes ago I had one of those "Duh!" experiences:

A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf". If such a link is found, assume that
it decribes to the page linking it.

Anyway, I will try my experiment again (when I have
time to set it up) and report the results. I hope that
lots of people link to separate RDF files on their sites
and my results will be better than last year when I
only looked for embedded RDF.

-Mark

Nick Kew · Oct 3, 2003

one of infinite monkeys said:
A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf".

Ahem ... the last few characters of a URL have absolutely no significance
except by convention. A spider that did that would be broken.

It could, however, look for links with the type="application/rdf+xml"
attribute. It would find a couple in my pages, for instance.

If such a link is found, assume that
it decribes to the page linking it.

Wouldn't it be better to believe the RDF concerning its own subject?

only looked for embedded RDF.

I played with embedding RDF (for automatically-generated reports),
but abandoned the idea as a nonstarter.

Jeen Broekstra · Oct 3, 2003

Nick said:
one of infinite monkeys at the keyboard of

Ahem ... the last few characters of a URL have absolutely no
significance except by convention. A spider that did that
would be broken.

It could, however, look for links with the
type="application/rdf+xml" attribute. It would find a couple
in my pages, for instance.

That would, however, only work if the web server from which the
file is hosted is aware of this mime type. I don't know if Apache
comes preconfigured with it these days but I'll bet that older
versions won't spot it (for example, my rdf file would not be
found since the department web server serves it as text/plain).

You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Jeen

Nick Kew · Oct 3, 2003

one of infinite monkeys said:
That would, however, only work if the web server from which the
file is hosted is aware of this mime type.

Nope. I said attribute.

I don't know if Apache
comes preconfigured with it these days but I'll bet that older

Neither do I; in any case it wouldn't do anything for the above example
which I deliberately (and perfectly legitimately) ended with .html
The server should of course serve it with the correct MIME type,
but that's another issue.

You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Even if .rdf gets something, it'll miss out on lots of .cgi, .php,
..xml and other things. It's simply broken.

Relying on the attribute will also miss out on many instances.
It's no more than a more correct thing than ".rdf" to look for
in (x)html links.

Mark Watson · Oct 3, 2003

Jeen Broekstra said:
You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

So, I will use both Nick's and Jeen's ideas.

Thanks,
Mark

Jeen Broekstra · Oct 3, 2003

Nick said:
infinite monkeys at the keyboard of Jeen Broekstra

Nope. I said attribute.
<link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">

Blimey. My bad, I completely misread your post.

Jeen

Nick Kew · Oct 3, 2003

one of infinite monkeys said:
It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

My previous post was just a correction to something you said, which I
felt called for correction because it so often leads to confusion.

My *practical" suggestion would be to send HEAD requests from the spider
to ascertain the type of any URL before actually fetching it. Then fetch
HTML and XHTML pages to spider for more links, and RDF pages for your
collection.

I happen to have spidering software that'll do all that - among other
things

Though I have the feeling you may not have the budget for it,
given the experimental nature of your task.

Embeding RDF in XHTML	4	Oct 1, 2003
Building my portfolio website fit for the semantic web. How?	0	Aug 30, 2013
Tagging HTML with RDF	1	Jul 5, 2005
Current agreement for representing lists in rdf?	0	Jan 22, 2007
Cypher - Natural Language to RDF/SeRQL for the Semantic Web	0	Jul 28, 2006
Best RDF Library?	9	Nov 8, 2004
Google spidering & traffic	4	Feb 20, 2007
Cypher - Natural Language to RDF/SeRQL for the Semantic Web	0	Jul 28, 2006

Spidering the web to find RDF

Mark Watson

Nick Kew

Jeen Broekstra

Nick Kew

Mark Watson

Jeen Broekstra

Nick Kew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads