suitable key for a hash

C

ccc31807

I have a data file to process that consists of about 25K rows and
about 30 columns. This file contains no column with unique values,
that is, every column contains duplicate values. I am placing the data
in a hash to process it (so I can access the data values by name
rather than position), and the only 'key' I can come up with is the $.
variable for the input line numbers.

Surely someone must have dealt with this problem before. Is there a
better solution?

The processing requires dumping the data into discrete categories,
e.g., level, state, person's name, status, for the purpose of
generating reports, e.g., by level, by state, by name, by status, and
not having a unique key isn't an issue.

CC.
 
R

RedGrittyBrick

I have a data file to process that consists of about 25K rows and
about 30 columns. This file contains no column with unique values,
that is, every column contains duplicate values. I am placing the data
in a hash to process it (so I can access the data values by name
rather than position), and the only 'key' I can come up with is the $.
variable for the input line numbers.

Surely someone must have dealt with this problem before. Is there a
better solution?

A better solution than
... $name{$index} ...
must surely be
... $name[$index] ...

I don't see any point using hashes if the key value is an integer in the
range 1..25000 with no gaps.

The processing requires dumping the data into discrete categories,
e.g., level, state, person's name, status, for the purpose of
generating reports, e.g., by level, by state, by name, by status, and
not having a unique key isn't an issue.

An SSCCE would help.
 
J

Jim Gibson

ccc31807 said:
I have a data file to process that consists of about 25K rows and
about 30 columns. This file contains no column with unique values,
that is, every column contains duplicate values. I am placing the data
in a hash to process it (so I can access the data values by name
rather than position), and the only 'key' I can come up with is the $.
variable for the input line numbers.

Surely someone must have dealt with this problem before. Is there a
better solution?

If you have records with duplicate keys and you want to store the data
in a hash for rapid lookup, use array references as hash values
(untested):

while(<>) {
my( $name, @rest ) = split;
push( @{$data{$name}}, \@rest );
}
The processing requires dumping the data into discrete categories,
e.g., level, state, person's name, status, for the purpose of
generating reports, e.g., by level, by state, by name, by status, and
not having a unique key isn't an issue.

Store the data in an array and create indices for key fields (untested);

while(<>) {
my @fields = split;
push( @data, \@fields );
push( @{$field1_index{$field[0]}}, $#data );
push( @{$field2_index{$field[1]}}, $#data );
...
}
 
X

Xho Jingleheimerschmidt

ccc31807 said:
I have a data file to process that consists of about 25K rows and
about 30 columns. This file contains no column with unique values,
that is, every column contains duplicate values.


Jointly, or just severly?

I am placing the data
in a hash to process it (so I can access the data values by name
rather than position),

If you wish to access it by name, then you must know what the name is.
and the only 'key' I can come up with is the $.
variable for the input line numbers.

Why not just an array, in that case?
Surely someone must have dealt with this problem before. Is there a
better solution?

The processing requires dumping the data into discrete categories,
e.g., level, state, person's name, status, for the purpose of
generating reports, e.g., by level, by state, by name, by status, and
not having a unique key isn't an issue.

Ok, so just stick it directly into those structures.

Xho
 
J

Justin C

I have a data file to process that consists of about 25K rows and
about 30 columns. This file contains no column with unique values,
that is, every column contains duplicate values. I am placing the data
in a hash to process it (so I can access the data values by name
rather than position), and the only 'key' I can come up with is the $.
variable for the input line numbers.

Surely someone must have dealt with this problem before. Is there a
better solution?

The processing requires dumping the data into discrete categories,
e.g., level, state, person's name, status, for the purpose of
generating reports, e.g., by level, by state, by name, by status, and
not having a unique key isn't an issue.

Instead of sticking it into a hash so that you can go over all of it
again, why not process (or part process) it into the relevant discrete
categories as part of the import?

Justin.
 
C

ccc31807

Thanks for your reply, and for all the others.

I decided to continue to use $. as the hash key. As it turns out, the
key isn't relevant to my application, as I'm not using the key to look
up the hash values. I'm just iterating through the hash, collecting
certain values, so the key is totally superfluous -- the only reason I
need a key is because of the nature of the hash.

I don't want to use an array because I'm creating a number of
different reports, and it's simply a lot easier to use values like:

$data{$key}{firstname}, $data{$key}{lastname}

than it is to use values like

$data[13456][2], $data[23543][3]

An SSCCE would help.

I'm sorry, but I don't know this. What is an SSCCE?

CC
 
D

Dr.Ruud

I decided to continue to use $. as the hash key.

If it smells like an array index ...

As it turns out, the
key isn't relevant to my application, as I'm not using the key to look
up the hash values. I'm just iterating through the hash, collecting
certain values, so the key is totally superfluous -- the only reason I
need a key is because of the nature of the hash.

I don't want to use an array because I'm creating a number of
different reports, and it's simply a lot easier to use values like:

$data{$key}{firstname}, $data{$key}{lastname}

than it is to use values like

$data[13456][2], $data[23543][3]

That is not the proper comparison.

$data[ $row ]{ firstname }

$data[ $row ][ FIRSTNAME ]

(assumes a numeric constant FIRSTNAME)

What is an SSCCE?

JFGI
 
J

Jürgen Exner

ccc31807 said:
I don't want to use an array because I'm creating a number of
different reports, and it's simply a lot easier to use values like:

$data{$key}{firstname}, $data{$key}{lastname}

than it is to use values like

$data[13456][2], $data[23543][3]

And why not use values like

$data[$key]{firstname}, $data[$key]{lastname}

jue
 
C

ccc31807

And why not use values like

        $data[$key]{firstname}, $data[$key]{lastname}

Because I wasn't completely truthful about my processing. I have to
break the data apart on various values, some if which are unique keys,
e.g., identification numbers for individual people. The data includes
clients and counselors, and (obviously) clients can have multiple
counselors and counselors can have multiple clients. Other values are
one of a kind, such as a person's address, regardless of the number of
times the particular person appears in the data. I have to cross
reference these values by unique keys, and I use five hashes to sort
out the data.

I see now that I could use an array for the handful of data elements
for each row that are unique.

Thanks, CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top