advice needed on maintaining large collections of unique ids

A

Andrew

Hello,

I am working on a system that stores items in a CMS. Each item is
indexed by a unique id. The uid can be quite large (255 bytes). We
will be having several million items in the CMS. Items occasionally
get deleted from the CMS. I need to write a program that watches the
CMS to see what new uids appear there. The size of a uid and the sheer
number of them prevent all the uids being held all in memory at the
same time. Does anyone have any suggestion as to how I could detect
new uids please?

One obvious all in memory solution is to maintain a Map of uids. One
reads in the old Map, scans the CMS and for each uid writes an entry
to the Map. This will update any entry already there and create an
entry that needs to be created. Simple. But it just takes up too much
memory.

I am wondering about how efficient it might be to append all the uids
from the current CMS to a file that contains the last collection of
uids. Then sort -u. There will be millions of records. Any thoughts?

Regards,

Andrew Marlow
 
T

Tom Anderson

I am working on a system that stores items in a CMS. Each item is
indexed by a unique id. The uid can be quite large (255 bytes). We will
be having several million items in the CMS. Items occasionally get
deleted from the CMS. I need to write a program that watches the CMS to
see what new uids appear there. The size of a uid and the sheer number
of them prevent all the uids being held all in memory at the same time.
Does anyone have any suggestion as to how I could detect new uids
please?

The best thing would be to trap additions and deletions as they happen,
rather than detecting them with some big diff-like process afterwards.
Database triggers, hooks into the CMS code, proxies on network
connections, things like that might work.

Do you just need to detect new items, or removed ones as well? Can you
affect the way the CMS stores things? If the former and yes, then add an
item number to each item, allocated from a counter. Keep track of the
highest number allocated so far. To find all items added since time t,
just look for all items with a number higher than the highest allocated
number at t. If you can't change the CMS code, you might be able to do
this by adding an autoincrementing ID column to the database underneath
it.
One obvious all in memory solution is to maintain a Map of uids. One
reads in he old Map, scans the CMS and for each uid writes an entry to
the Map. This will update any entry already there and create an entry
that needs to be created. Simple. But it just takes up too much memory.

Not it doesn't. Unless you're really tight on memory, you should be able
to fit all the IDs in memory.

If memory is tight, you could look for ways to reduce the size of the ID
set.

Do the IDs have common prefixes? For instance, are they path-like, like
/articles/news/2009/january/26/wales/7539084? If so, you can store them
much more compactly in a trie (probably - it depends on exactly how much
commonality there is).

Are the IDs text strings? If so, they probably contain much less
information than data; you could probably compress them quite a bit -
english text typically has ~1.5 bits of information per character, so a
255-character key could typically be reduced to 48 bytes. A simple Huffman
code, or perhaps a second-order scheme of some sort, would do it. To avoid
having to make two passes to build the structure (one to collect
statistics to construct the code, and one to encode the strings),
construct a code upfront and apply the same code each time - it will
probably be good enough. If you wanted, you could update the code
alongside doing an encoding run - while applying the code to batch n,
you'd be building a code ready to use for n+1.

You can probably combine the two approaches if you're clever. Compressed
tries - someone must have invented those.

tom
 
T

Tom Anderson

You can probably combine the two approaches if you're clever. Compressed
tries - someone must have invented those.

Yeah, you just compress and then build a trie. Duh.

I was thinking of something where you'd build a trie, but then identify
runs of nodes where there's no branching (stems, if you like), and express
those more compactly. Thinking about it, that's exactly what a Patricia
tree does.

There's also Hu-Tucker coding, which is a variant of Huffman coding where
the lexicographical order of the codes is the same as of the keys. If you
Hu-Tucker code and then build a trie, the leaves will be in the same order
as in an uncompressed trie, so you can do range queries (i think), which
you can't in a straightforward compressed trie.

tom
 
R

Roedy Green

Does anyone have any suggestion as to how I could detect
new uids please?

Approach 1:
Let us say you can afford X bytes of RAM to speed detection.

You use it for a giant BitSet, so you track 8 bins per byte.

Then you take some sort of hash/digest of your giant CID. See
http://mindprod.com/jgloss/hashCode.html
http://mindprod.com/jgloss/diggest.thtml

Then take the modulus relative to X*8.

Use this index to turn that bit on.

Store the CID with the modulus in an SQL database indexing by modulus.

Whenever you decide you need to process new elements, skim the bitset
for ones. Then ask the SQL database to hand over all CIDs with that
index. Check them for newness.

Approach 2:

store your CIDs in a SQL database. Timestamp them and index by
timestamp. Store a bit with hem to record whether they have been
processed already as new.

Approach 3:

store your CIDs in an SQL database. Put the new ones in a different
table. After they have been processed and are no longer considered
new, move them to a different table.

I think Approach 3 will prove by far the fastest.
--
Roedy Green Canadian Mind Products
http://mindprod.com

If everyone lived the way people do in Vancouver, we would need three more entire planets to support us.
~ Guy Dauncey
 
S

Seamus MacRae

Andrew said:
Hello,

I am working on a system that stores items in a CMS. Each item is
indexed by a unique id. The uid can be quite large (255 bytes). We
will be having several million items in the CMS. Items occasionally
get deleted from the CMS.

That's not good. That's more or less equivalent to "links to parts of
our web site occasionally get broken", which in turn hurts traffic,
which in turn hurts revenue (assuming you do something to monetize
traffic, such as host ads).

Please try to avoid gratuitously breaking links. It annoys users, who
drive traffic, and those that link to you, who drive even more traffic
(especially when they're search engines).
I need to write a program that watches the
CMS to see what new uids appear there. The size of a uid and the sheer
number of them prevent all the uids being held all in memory at the
same time. Does anyone have any suggestion as to how I could detect
new uids please?

You need something like a HashSet on disk.

You might try taking the first few bytes of the UIDs, as hex strings
like "07ff" or "3f2a", and using those as the names of files that
contain serialized HashSets. When a UID is seen, the first few bytes are
used to pick a HashSet to load, by a method that returns an empty
HashSet if the file doesn't yet exist. This set is then searched for the
UID. If it's present, some code is called. If it's absent, some other
code is called, and the HashSet has the UID added and is written to the
appropriately-named file.
One obvious all in memory solution is to maintain a Map of uids. One
reads in the old Map, scans the CMS and for each uid writes an entry
to the Map. This will update any entry already there and create an
entry that needs to be created. Simple. But it just takes up too much
memory.

If you want to associate something with each UID, replace HashSet above
with HashMap and have the callback code provide a value to associate
with the UID in the map, and the map written back even if the UID was
found if this returned a different value than had previously been
associated with that UID.
 
F

Frank Cisco

Andrew said:
Hello,

I am working on a system that stores items in a CMS. Each item is
indexed by a unique id. The uid can be quite large (255 bytes). We
will be having several million items in the CMS. Items occasionally
get deleted from the CMS. I need to write a program that watches the
CMS to see what new uids appear there. The size of a uid and the sheer
number of them prevent all the uids being held all in memory at the
same time. Does anyone have any suggestion as to how I could detect
new uids please?

One obvious all in memory solution is to maintain a Map of uids. One
reads in the old Map, scans the CMS and for each uid writes an entry
to the Map. This will update any entry already there and create an
entry that needs to be created. Simple. But it just takes up too much
memory.

I am wondering about how efficient it might be to append all the uids
from the current CMS to a file that contains the last collection of
uids. Then sort -u. There will be millions of records. Any thoughts?

Regards,

Andrew Marlow

Why do you need 255 bytes for a unique id of a collection of several
million?
 
A

Andrew

How about using a RDBMS? Most have the possibility to create triggers
that fire on changes (inserts in your case).
That really sounds like a good candidate for RDBMS. Doesn't your CMs
already use one? Look into triggers for it.

The CMS does not use a RDBMS. The CMS does have triggers but the
triggers can only do things in the CMS world. I do intend to store the
uids in a RDBMS, the question is how to detect that new uids have
arrived in the CMS.
 
A

Andrew

According to Andrew   said:
The uid can be quite large (255 bytes). We will be having several
million items in the CMS. [...] The size of a uid and the sheer number
of them prevent all the uids being held all in memory at the same
time.

Does it ?
Yes.

Even if all UIDs have their maximum length (255 bytes), you
can store 4 millions of them in a gigabyte of RAM.

That happens to be too much in my case.
Even cheap PC have at
least that much RAM; my office PC has 8 GB.

So does mine. But it cannot all be grabbed by one app.
Note that the "sort" Unix utility sorts _lines_ of _text_. If your
UIDs are arbitrary bytes, you would need some escaping mechanism to
avoid spurious end-of-lines.

Not needed in my case. The uids are actually DOIs (from the world of
digital publishing) which are ASCII text.
 
A

Andrew

Any CMS should be able to tell you this sort of them, surely?

Yes. This CMS even has triggers. But I want the uids to be recorded in
a RDBMS. The CMS triggers have no way to do that.
 
A

Andrew

That's not good. That's more or less equivalent to "links to parts of
our web site occasionally get broken"

No it's not. The items are actually published content and sometimes
content has to be 'unpublished'. During the time it was online it may
be visited. My app is one that counts published article visits, so I
need to record the uid (and its title etc) even if the article is
deleted later. This is not a bug, this is normal business
functionality. There are no broken links.
You need something like a HashSet on disk.

Yes.
 
L

Lew

Andrew said:
According to Andrew said:
The uid can be quite large (255 bytes). We will be having several
million items in the CMS. [...] The size of a uid and the sheer number
of them prevent all the uids being held all in memory at the same
time.
Does it ?
Yes.

Even if all UIDs have their maximum length (255 bytes), you
can store 4 millions of them in a gigabyte of RAM.

That happens to be too much in my case.
Even cheap PC have at
least that much RAM; my office PC has 8 GB.

So does mine. But it cannot all be grabbed by one app.

Modern OSes have a lovely little thing called virtual memory that allow
programs to access the full address space, even when other programs are also
doing so, even when there isn't that much physical RAM on the system.
 
S

Seamus MacRae

Andrew said:
No it's not.

Yes it is.
The items are actually published content and sometimes
content has to be 'unpublished'.

Which would break any links to it. That's bad, mmkay?
This is not a bug, this is normal business functionality.

Apparently, "normal business functionality" for you involves taking
things away from the people sometimes, which, unless you're talking
repossessing collateral for a loan or something, is bad, mmkay?
There are no broken links.

Sure there are. If you publish something, someone (e.g. Google) links to
it, and then you "unpublish" it, those links break.

People WILL deep-link to individual internal pages on your web site, and
you should encourage and welcome this as people much prefer to go
straight to a relevant page rather than land at your front page and then
have to find their way to the relevant page using your site's internal
navigation system. Like it or not, that's simply the way it is.
Furthermore, storage space is cheap. There's no reason not to keep an
archive, or to remove or expire things from it, in this day and age,
though if something is outdated you might want to modify it to add an
addendum, correction, or link to newer material, or to change it to be
up to date.

If you prefer people to have to navigate from your front page to your
internal pages so they can be blitzed with lots more advertising than if
they landed straight there, then too bad. People will jump directly to,
and bookmark, and post links to, your internal pages. If you keep
deleting or moving things to frustrate them, they will surf elsewhere
and not bother with your site. That is simply the nature of the Web
user, and though you might rail against it you cannot change it.
 
A

Andrew

I was sort-of-assuming any decent CMS was layered over an RDBMS
that could be interrogated directly.

   BugBear

Not this one. In fact, the more I lean about this CMS, the less I like
it. I have been looking at an open source digital library
implementation, dspace, which does use an RDBMS. It would be much
easier to solve the problem with that as a starting point.
 
T

Tom Anderson

The CMS does not use a RDBMS.

What does it have? Files?

How are you getting the IDs out?
The CMS does have triggers but the
triggers can only do things in the CMS world.

What kinds of things are those? How do you specify the trigger actions?

tom
 
T

Tom Anderson

The uids are actually DOIs (from the world of digital publishing) which
are ASCII text.

Those are definitely quite compressible. The trouble is that you may have
to use different compression schemes for each registrant. You can strip
off the leading "10.", because that's always the same, and you can convert
the following digit string to some more compact identifier - i would
suggest using a Huffman code here, so that more common registrants get
shorter codes.

You should then switch to a registrant-specific scheme. For example,
Nature papers have a part which looks like "nature04924", so strip the
"nature" and encode the five digits in 17 bits. Or, since so far all of
the first digits are zero, encode the four nonzero digits in fourteen
bits. News articles and preprints have a different format, so you'll need
a couple of bits to distinguish those, and schemes for encoding their
digits.

That's likely to involve a huge amount of work, so a simpler approach
would be to use a normal Huffman code, but have a different codebook for
each registrant. It wouldn't take advantage of structure, as in the Nature
example, but it would still take a lot less than eight (or sixteen!) bits
per character.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top