R
Rico
This is not a Java-specific question but it would definitely be meant for
a programmer. So I'm thinking I'm at least half on-topic here.
I'm trying to get a firmer grasp of B+-trees as used for indexing.
I have an input text file with around a thousand lines as follows:
0|Item65| leaguers rebuked realizing southwest haply <....> |81.44|1
1|Item55| Portugal movings grooms backstitching octane <.....> |86.8|3
A unique line number, an item label, a list of words associated with the
item, a price and an inventory level.
http://cs.ecs.baylor.edu/~speegle/5335/indexes.html
I think that a purely RAM implementation means that I'd have my own node
structures into which I insert the data from the input file.
What if I need to make each node its own file and the pointers are
supposed to be filenames?
Am I still supposed to load the data into node structures or is the point
to just rely on the O/S taking care of loading the page as a disk block?
Also, the specifications just say that each node can have from 50 to 200
pointers. I think it makes sense for non-leaf nodes: with a disk block
size of 4096 bytes, and 20 bytes being enough to store a keyword/pointer
from which I can derive the filename, 200 pointers can fit in.
But I'm not sure whether that variable number of pointers applies to
leaf nodes, or bucket files, as well. After analyzing the data, e.g
highest occurrence of a keyword on distinct lines and longest keyword,
I'm thinking that it doesn't because 50 keywords and the list of line
numbers on which each occurs wouldn't fit in one disk block, let alone
then 200 keywords.
What I would like to know is, am I doing it right?
The quick analysis of the data was done programmatically using
hashtables. Is that "cheating" or is making such assumptions about the
data necessary in designing the B+-tree ?
Thanks.
Rico.
a programmer. So I'm thinking I'm at least half on-topic here.
I'm trying to get a firmer grasp of B+-trees as used for indexing.
I have an input text file with around a thousand lines as follows:
0|Item65| leaguers rebuked realizing southwest haply <....> |81.44|1
1|Item55| Portugal movings grooms backstitching octane <.....> |86.8|3
A unique line number, an item label, a list of words associated with the
item, a price and an inventory level.
http://cs.ecs.baylor.edu/~speegle/5335/indexes.html
I think that a purely RAM implementation means that I'd have my own node
structures into which I insert the data from the input file.
What if I need to make each node its own file and the pointers are
supposed to be filenames?
Am I still supposed to load the data into node structures or is the point
to just rely on the O/S taking care of loading the page as a disk block?
Also, the specifications just say that each node can have from 50 to 200
pointers. I think it makes sense for non-leaf nodes: with a disk block
size of 4096 bytes, and 20 bytes being enough to store a keyword/pointer
from which I can derive the filename, 200 pointers can fit in.
But I'm not sure whether that variable number of pointers applies to
leaf nodes, or bucket files, as well. After analyzing the data, e.g
highest occurrence of a keyword on distinct lines and longest keyword,
I'm thinking that it doesn't because 50 keywords and the list of line
numbers on which each occurs wouldn't fit in one disk block, let alone
then 200 keywords.
What I would like to know is, am I doing it right?
The quick analysis of the data was done programmatically using
hashtables. Is that "cheating" or is making such assumptions about the
data necessary in designing the B+-tree ?
Thanks.
Rico.