Create an index from a webpage

S

Simon Cropper

Hi,

I am getting dizzy on google.

I am after a way of pointing a python routine to my website and have it
create a tree, represented as a hierarchical HTML list in a webpage, of
all the pages in that website (recursive list of internal links to HTML
documents; ignore images, etc.).

It is essentially a contents page or sitemap for the site.

Interestingly, despite trying quite a few keyword combinations, I was
unable to find such a script.

Anyone have any ideas?

--
Cheers Simon

Simon Cropper - Open Content Creator / Website Administrator

Free and Open Source Software Workflow Guides
 
T

Thomas 'PointedEars' Lahn

Simon said:
I am after a way of pointing a python routine to my website and have it
create a tree, represented as a hierarchical HTML list in a webpage, of
all the pages in that website (recursive list of internal links to HTML
documents; ignore images, etc.).

It is essentially a contents page or sitemap for the site.

<http://lmgtfy.com/?q=python+sitemap>

If all else fails, use markup parsers like

- <http://www.crummy.com/software/BeautifulSoup/>
- <http://lxml.de/>

and write it yourself. It is not hard to do.
 
S

Steven D'Aprano

Thomas said:

[climbs up on the soapbox and begins rant]

Please don't use lmgtfy. The joke, such as it is, stopped being funny about
three years ago. It's just annoying, and besides, it doesn't even work
without Javascript. Kids today have no respect, get off my lawn, grump
grump grump...

It's no harder to put the search terms into a google URL, which still gets
the point across without being a dick about it:

www.google.com/search?q=python+sitemap

[ends rant, climbs back down off soapbox]

Or better still, use a search engine that doesn't track and bubble your
searches:

https://duckduckgo.com/html/?q=python+sitemap

You can even LMDDGTFY if you insist.

http://lmddgtfy.com/


Completely-undermining-my-own-message-ly y'rs,
 
S

Simon Cropper

[SNIP]
It's no harder to put the search terms into a google URL, which still gets
the point across without being a dick about it:
[SNIP]

[RANT]

OK I was not going to say anything but...

1. Being told to google-it when I explicitly stated in my initial post
that I had been doing this and had not been able to find anything is
just plain rude. It is unconstructive and irritating.

2. I presume that python-list is a mail list for python users -
beginners, intermediate and advanced. If it is not then tell me and I
will go somewhere else.

3. Some searches, particularly for common terms throw millions of hits.
'Python' returns 147,000,000 results on google, 'Sitemap' returns
1,410,000,000 results. Even 'Python AND Sitemap' still returns 5,020
results. Working through these links takes you round and round with no
clear solutions. Asking for help on the primary python mail list --
after conducting a preliminary investigation for tools, libraries, code
snippets seemed legitimate.

4. AND YES, I could write a program but why recreate code when there is
a strong likelihood that code already exists. One of the advantages of
python is that a lot of code is redistributed under licences that
promote reuse. So why reinvent the wheel when their is a library full of
code. Sometimes you just need help finding the door.

4. If someone is willing to help me, rather than lecture me (or poke me
to see if they get a response), I would appreciate it.

[END RANT]

For people that are willing to help. My original request was...

I am after a way of pointing a python routine to my website and have it
create a tree, represented as a hierarchical HTML list in a webpage, of
all the pages in that website (recursive list of internal links to HTML
documents; ignore images, etc.).

In subsequent notes to Thomas 'PointedEars'...

I pointed to an example of the desired output here
http://lxml.de/sitemap.html

--
Cheers Simon

Simon Cropper - Open Content Creator / Website Administrator

Free and Open Source Software Workflow Guides
 
S

Steven D'Aprano

Simon said:
1. Being told to google-it when I explicitly stated in my initial post
that I had been doing this and had not been able to find anything is
just plain rude. It is unconstructive and irritating.

Why so you did. Even though I wasn't the one who told you to google it, I'll
apologise too because I was thinking the same thing. Sorry about that.

3. Some searches, particularly for common terms throw millions of hits.
'Python' returns 147,000,000 results on google, 'Sitemap' returns
1,410,000,000 results. Even 'Python AND Sitemap' still returns 5,020
results.

How about "python generate a site map"? The very first link on DuckDuckGo is
this:

http://www.conversationmarketing.com/2010/08/python-sitemap-crawler-1.htm

Despite the domain, there is actual Python code on the page. Unfortunately
it looks like crappy code with broken formatting and a mix of <\br> tags,
but it's a start.

Searching for "site map" on PyPI returns a page full of hits:

http://pypi.python.org/pypi?:action=search&term=site+map&submit=search

Most of them seem to rely on a framework like Django etc, but you might find
something useful.

4. AND YES, I could write a program but why recreate code when there is
a strong likelihood that code already exists.

"Strong" likelihood? Given how hard it is to find an appropriate sitemap
generator written in Python, I'd say there is a strong likelihood that one
that meets your needs and is publicly available under an appropriate
licence is vanishingly small.

If you do decide to write your own, please consider uploading it to PyPI
under a FOSS licence.
 
S

Simon Cropper

If you do decide to write your own, please consider uploading it to PyPI
under a FOSS licence.

At present I am definitely getting the impression that my assumption
that something like this' must out there', is wrong.

I am following people's links and suggestions (as well as my own; I have
spent 1-2 hours looking) but have not found anything that is able to be
used with only minor adjustments.

I have found a XML-Sitemaps Generator at http://www.xml-sitemaps.com,
this page allows you to create the XML files that can be uploaded to
google. But as stated I don't actually want what people now call
'sitemaps' I want a automatically updated 'index / contents page' to my
website. For example, if I add a tutorial or update any of my links I
want the 'global contents page' to be updated when the python script is run.

I am now considering how I might address this requirement. If I create a
python script I will post it on PyPI. As with all my work it will be
released under the GPLv3 licence.

Thanks for your help.

--
Cheers Simon

Simon Cropper - Open Content Creator / Website Administrator

Free and Open Source Software Workflow Guides
 
C

Chris Angelico

At present I am definitely getting the impression that my assumption that
something like this' must out there', is wrong.

I have found a XML-Sitemaps Generator at http://www.xml-sitemaps.com,
this page allows you to create the XML files that can be uploaded to google.
But as stated I don't actually want what people now call 'sitemaps' I want a
automatically updated 'index / contents page' to my website. For example, if
I add a tutorial or update any of my links I want the 'global contents page'
to be updated when the python script is run.

What you're looking at may be closer to autogenerated documentation
than to a classic site map. There are a variety of tools that generate
HTML pages on the basis of *certain information found in* all the
files in a directory (as opposed to the entire content of those
files). What you're trying to do may be sufficiently specific that it
doesn't already exist, but it might be worth having a quick look at
autodoc/doxygen - at least for some ideas.

Chris Angelico
 
S

Simon Cropper

What you're looking at may be closer to autogenerated documentation
than to a classic site map. There are a variety of tools that generate
HTML pages on the basis of *certain information found in* all the
files in a directory (as opposed to the entire content of those
files). What you're trying to do may be sufficiently specific that it
doesn't already exist, but it might be worth having a quick look at
autodoc/doxygen - at least for some ideas.

Chris Angelico

Chris,

You assessment is correct. Working through the PyPI I am having better
luck with using different terms than the old-term 'sitemap'.

I have found a link to funnelweb which uses the transmogrify library
(yeah, as if I would have typed this term into google!) that is
described as "Crawl and parse static sites and import to Plone".

http://pypi.python.org/pypi/funnelweb/1.0

As funnelweb is modular, using a variety of the transmogrify tools,
maybe I could modify this to create a 'non-plone' version.

--
Cheers Simon

Simon Cropper - Open Content Creator / Website Administrator

Free and Open Source Software Workflow Guides
 
C

Chris Angelico

Chris,

You assessment is correct. Working through the PyPI I am having better luck
with using different terms than the old-term 'sitemap'.

I have found a link to funnelweb which uses the transmogrify library (yeah,
as if I would have typed this term into google!) that is described as "Crawl
and parse static sites and import to Plone".

And once again, python-list has turned a rant into a useful,
informative, and productive thread :)

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top