look up very large table

E

ela

I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia
.....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
....

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Any advice? or should i resort to some other languages?
 
J

Jim Gibson

ela said:
I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

There is no benefit that I can see to concatenating the files. Use the
File::Find module to find all files with a certain naming convention,
read each one, and process the information in each file. As far as the
amount of information that Perl can handles, that is mostly determined
by the available memory and how smart you are at condensing the data,
keeping only what you need and throwing away stuff you don't need.
The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia

Perl does not have tables. It has arrays and hashes. You can nest
arrays and hashes to store complex datasets in memory by using
references.
....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
...

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Any advice? or should i resort to some other languages?

Try reading all the files and saving the data you want. If you run out
of memory, then think about a different approach. 32GB of memory is
quite a lot.

If you can't fit all of your data into memory at one time, you might
consider using a database that will store your data in files. Perl has
support for many databases. But I would first determine whether or not
you can fit everything in memory.
 
C

ccc31807

Any advice? or should i resort to some other languages?

Perl is probably your best bet for this task.
I have some large data in pieces, e.g.
asia.gz.tar 300M
or
roads1.gz.tar 100M

It might be helpful for you to give a sample of your data format. You
don't mention untarring and unzipping your file, so I assume that you
are dealing with ASCII text. If not, then some of the following might
not work well.
I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

Irrelevant question. Ordinarily you process files one line at a time,
so it doesn't make any difference how large a particular file is, as
long as each line can be manipulated. In cases where I have to deal
with a number of files, I find it easier to glob the files, or open
and read a directory, to automate the process of opening, reading, and
closing a number of files. You might gain something in particular
cases by combining files, but I don't see any general advantage in
doing so.
each row may come from a big file of >100M (as aforementioned):
CITY    Beijing
NOTE    Capital
RACE    Chinese
...

Typical data munging. Depending on whether you have duplicates, I
would probably build a hash and write the hash to your output file.
You then have be ability to sort on different fields, e.g., cities,
notes, races, etc.
And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Hashing is ideal, provided you can link the two files by a common
record. The general technique is to open the ID file first, build a
hash of record IDs, then open your data file and populate the hash
records with data according to the common record. Then, open your
output file and print to it.

If you will use the data frequently, you might want to stuff the data
into a database so you can query it conveniently.

If you want help, please be sure to furnish both sample data from each
file and your attempts at writing the script.

CC.
 
J

John Bokma

ela said:
I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia
....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
...

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Any advice? or should i resort to some other languages?

How about importing all your data into a database, and using SQL to
extract what you want? Depending on the format of your input files some
parsing might be required which can be done with a small Perl
program.
 
J

Jürgen Exner

ela said:
I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file

I may be mistaken but isn't that a prerequisite to actually extract any
data from compressed (.gz) file?
and then perform parsing them into a large table (I don't know whether
perl can handle that...).

The hardware is the limit.
The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia
....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
...

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Depends. It's easy enough to do so you can just try if it works.
Any advice? or should i resort to some other languages?

If at all you are hardware limited, Eventually the system will begin
swapping. And that will happen in any language if you try to keep too
much data in RAM.
If that happens you will have to revert to time-proven techniques from
the dark ages: trade HD space and time for RAM by keeping only one set
of data in RAM and annotate that set while processing the second set of
data from the HD line by line.

However the real solution would be to load the whole enchilada into a
database and then do whatever join you want to do. There is a reason why
database system have been created and optimized for exactly such tasks.

jue
 
S

sln

I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

The final table should look like this:
[snip examples that doesen't tell info]
Any advice? or should i resort to some other languages?

Yes, go back to the database that produced these files
and run a different querry to get the info you need.

If your not still with the company who owns this information,
I suggest you contact them for permission to use this information.

-sln
 
X

Xho Jingleheimerschmidt

ela said:
I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

The data is first gzipped and then tarred? That is an odd way of doing
things.
I wonder whether I should concatenate them all into a single ultra large
file

I see no reason to do that. Especially as I don't think tar format
supports that cleanly, does it?
and then perform parsing them into a large table (I don't know whether
perl can handle that...).

I bet it can.
The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia
.....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
....

What is the ",,,," hiding? 100M is an awful lot of "...."

Each file is turned into only one row? And each file is 100M? So how
many rows do you anticipate having?

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Why not?
Any advice? or should i resort to some other languages?

Your description is too vague to give any reasonable advice.


Xho
 
X

Xho Jingleheimerschmidt

Jürgen Exner said:
However the real solution would be to load the whole enchilada into a
database and then do whatever join you want to do. There is a reason why
database system have been created and optimized for exactly such tasks.

Database systems are generally created for atomicity, concurrency,
isolation, and durability, which is quite a bit more than this task
seems to consist of. It is my general experience that in this type of
task, a Perl script could be written and have completed its job while
the database system is still tying its shoes.


Xho
 
J

Jürgen Exner

Xho Jingleheimerschmidt said:
Database systems are generally created for atomicity, concurrency,
isolation, and durability, which is quite a bit more than this task
seems to consist of. It is my general experience that in this type of
task, a Perl script could be written and have completed its job while
the database system is still tying its shoes.

Certainly true. But they are also designed to handle vast amounts of
data efficiently. And if the OP indeed runs into space issues then a DB
system may (just may!) provide an easier to use and even faster
alternative to looping through files over and over again.

If it actually is an advantage or not is hard to tell. I agree, the OPs
task seems to be easy enough to be solved in a single pass. But the
description was rather cryptic, too, so there might be more
cross-references going on than either of us is expecting at this time.

jue
 
C

ccc31807

I'd only add that often databases aren't a very good choice for data
that doesn't change. Databases are way over used, IMO.

Often, an index structure built out of flat files is superior (doesn't have all
that extra baggage of servers, code to carefully insert/update and such)

Sometimes, you just can't beat a set of well formed text files on
top of a carefully planned directory. :)

Amen, amen, and amen. In my line of work, I see Access almost
exclusively used as a productivity tool (e.g., to produce reports) and
never (amost) as a database. Personally, I have created hundreds if
not a thousand or two of Access databases, and I can't recall using a
Primary Key in Access.

The same thing can also be said for Excel.

I totally agree that many times using some kind of delimited file is
much easier, simpler, and faster than using a database. For one time
processing, building a data structure in memory is also much eaiser,
simpler, and faster.

CC.
 
P

Peter J. Holzer

I'd only add that often databases aren't a very good choice for data
that doesn't change. Databases are way over used, IMO.

Often, an index structure built out of flat files is superior (doesn't have all
that extra baggage of servers, code to carefully insert/update and such)

Sometimes, you just can't beat a set of well formed text files on
top of a carefully planned directory. :)

You are confusing databases with RDBMSs.

A database is a collection of structured data.

A relational database is a database where the data is organized into
"relations", i.e., tables with certain properties (columns are ordered
and have type and usually a name, order of rows doesn't matter, etc.)

A relational database management system is a program for managing
relational databases - it lets you insert, update and query the data,
enforces consistency, maintains indexes, implements transactions, etc.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,005
Messages
2,570,264
Members
46,859
Latest member
HeidiAtkin

Latest Threads

Top