OT: MoinMoin and Mediawiki?

I

Ian Bicking

Paul said:
I just looked at c2; it has about 30k pages (I'd call this medium
sized) and finds incoming links pretty fast. Is it using MoinMoin?
It doesn't look like other MoinMoin wikis that I know of. I'd like to
think it's not finding those incoming links by scanning 30k separate
files in the file system.

c2 is the Original Wiki, i.e., the first one ever, and the system that
coined the term. It's written in Perl. It's a definitely not an
advanced Wiki, and it's generally relied on social rather than technical
solutions to problems. Which might be a Wiki principle in itself.
While I believe it used full text searches for things like backlinks in
the past, I believe it uses some kind of index now.
Sometimes I think a wiki could get by with just a few large files.
Have one file containing all the wiki pages. When someone adds or
updates a page, append the page contents to the end of the big file.
That might also be a good time to pre-render it, and put the rendered
version in the big file as well. Also, take note of the byte position
in the big file (e.g. with ftell()) where the page starts. Remember
that location in an in-memory structure (Python dict) indexed on the
page name. Also, append the info to a second file. Find the location
of that entry and store it in the in-memory structure as well. Also,
if there was already a dict entry for that page, record a link to the
old offset in the 2nd file. That means the previous revisions of a
file can be found by following the links backwards through the 2nd
file. Finally, on restart, scan the 2nd file to rebuild the in-memory
structure.

That sounds like you'd be implementing your own filesystem ;)

If you are just trying to avoid too many files in a directory, another
option is to put files in subdirectories like:

base = struct.pack('i', hash(page_name))
base = base.encode('base64').strip().strip('=')
filename = os.path.join(base, page_name)
 
P

Paul Rubin

Ian Bicking said:
That sounds like you'd be implementing your own filesystem ;)

Yes, this shouldn't be any surprise. Implementing a special purpose
file system what every database essentially does.
If you are just trying to avoid too many files in a directory, another
option is to put files in subdirectories like:

base = struct.pack('i', hash(page_name))
base = base.encode('base64').strip().strip('=')
filename = os.path.join(base, page_name)

Using subdirectories certainly keeps directory size down, and it's a
good idea for MoinMoin given the way MoinMoin uses the file system.
But for really big wikis, I think using the file system like that
isn't workable even with subdirectories. Plus, there's the issue of
how to find backlinks and how to do full text search.
 
I

Ian Bicking

Paul said:
Using subdirectories certainly keeps directory size down, and it's a
good idea for MoinMoin given the way MoinMoin uses the file system.
But for really big wikis, I think using the file system like that
isn't workable even with subdirectories. Plus, there's the issue of
how to find backlinks and how to do full text search.

If the data has to be somewhere, and you have to have relatively random
access to it (i.e., access any page; not necessarily a chunk of a page),
then the filesystem does that pretty well, with lots of good features
like caching and whatnot. I can't see a reason not to use the
filesystem, really.

For backlink indexing, that's a relatively easy index to maintain
manually, simply by scanning pages whenever they are modified. The
result of that indexing can be efficiently put in yet another file
(well, maybe one file per page).

For full text search, you'll want already-existing code to do it for
you. MySQL contains such code. But there's also lots of that software
that works well on the filesystem to do the same thing.

A database would be important if you wanted to do arbitrary queries
combining several sources of data. And that's certainly possible in a
wiki, but that's not so much a scaling issue as a
flexibility-in-reporting issue.
 
P

Paul Rubin

Ian Bicking said:
If the data has to be somewhere, and you have to have relatively
random access to it (i.e., access any page; not necessarily a chunk of
a page), then the filesystem does that pretty well, with lots of good
features like caching and whatnot. I can't see a reason not to use
the filesystem, really.

For one thing you waste lots of space for small files because of
partially empty blocks at the end of each page. Sure, disk space is
cheap, but you similarly waste space in your ram cache, which impacts
performance and isn't so cheap. For another, you need multiple seeks
to get to each file (scan the directory to get the inode number, read
the inode, get the list of indirect blocks from the file, read each of
those, etc). With big files, the inode and indirect blocks will be
cached, so you only have to seek once. Finally, you lose some control
over what's in ram and what needs to be retrieved. You can do a
better job of tuning your cache strategy to the precise needs of your
wiki, than the file system can with its one-size-fits-all approach.
For backlink indexing, that's a relatively easy index to maintain
manually, simply by scanning pages whenever they are modified. The
result of that indexing can be efficiently put in yet another file
(well, maybe one file per page).

Opening and closing the extra files imposes considerable overhead,
though it would take actual benchmarking to get precise figures.
For full text search, you'll want already-existing code to do it for
you. MySQL contains such code. But there's also lots of that
software that works well on the filesystem to do the same thing.

Have you ever used the MySQL fulltext search feature? It's awful.
A database would be important if you wanted to do arbitrary queries
combining several sources of data. And that's certainly possible in a
wiki, but that's not so much a scaling issue as a
flexibility-in-reporting issue.

An RDBMS is a good backend for a medium sized wiki, since it takes
care of so many issues for you. For a very big wiki that needs high
performance, there are better approaches, though they take more work.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top