Large Amount of Data

J

Jack

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
 
M

Matimus

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

The OS will take care of memory swapping. It might get slow, but I
don't think it should fail.

Matt
 
M

Marc 'BlackJack' Rintsch

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch
 
K

kaens

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

Could you process it in chunks, instead of reading in all the data at once?
 
V

Vyacheslav Maslov

Larry said:
Purchase more memory. It is REALLY cheap these days.
Not a solution at all. What about if amount of data exceed architecture
memory limits? i.e. 4Gb at 32bit.

Better solution is to use database for data storage/processing
 
D

Dennis Lee Bieber

Thanks for the replies!

Database will be too slow for what I want to do.
Slower than having every process on the computer potentially slowed
down due to page swapping (and, for really huge data, still running the
risk of exceeding the single-process address space)?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
J

John Nagle

Jack said:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle
 
J

Jack

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.
 
J

Jack

I suppose I can but it won't be very efficient. I can have a smaller
hashtable,
and process those that are in the hashtable and save the ones that are not
in the hash table for another round of processing. But chunked hashtable
won't work that well because you don't know if they exist in other chunks.
In order to do this, I'll need to have a rule to partition the data into
chunks.
So this is more work in general.
 
J

Jack

If swap memery can not handle this efficiently, I may need to partition
data to multiple servers and use RPC to communicate.
 
M

Marc 'BlackJack' Rintsch

I have tens of millions (could be more) of document in files. Each of them
has other properties in separate files. I need to check if they exist,
update and merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and updating a database will take a long time...

But databases are exactly build and optimized to handle large amounts of
data.
Let's say, I want to do something a search engine needs to do in terms
of the amount of data to be processed on a server. I doubt any serious
search engine would use a database for indexing and searching. A hash
table is what I need, not powerful queries.

You are not forced to use complex queries and an index is much like a hash
table, often even implemented as a hash table. And a database doesn't
have to be an SQL database. The `shelve` module or an object DB like zodb
or Durus are databases too.

Maybe you should try it and measure before claiming it's going to be too
slow and spend time to implement something like a database yourself.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Steve Holden

Jack said:
> I have tens of millions (could be more) of document in files. Each of them
> has other
> properties in separate files. I need to check if they exist, update and
> merge properties, etc.
> And this is not a one time job. Because of the quantity of the files, I
> think querying and
> updating a database will take a long time...
>
And I think you are wrong. But of course the only way to find out who's
right and who's wrong is to do some experiments and get some benchmark
timings.

All I *would* say is that it's unwise to proceed with a memory-only
architecture when you only have assumptions about the limitations of
particular architectures, and your problem might actually grow to exceed
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to
take a real nose-dive. Then where do you go? Much better to architect
the application so that you anticipate exceeding memory limits from the
start, I'd hazard.
> Let's say, I want to do something a search engine needs to do in terms of
> the amount of
> data to be processed on a server. I doubt any serious search engine would
> use a database
> for indexing and searching. A hash table is what I need, not powerful
> queries.
>
You might be surprised. Google, for example, use a widely-distributed
and highly-redundant storage format, but they certainly don't keep the
whole Internet in memory :)

Perhaps you need to explain the problem in more detail if you still need
help.

regards
Steve


--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
 
J

John Machin

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?



And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Don't think, benchmark.
Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?
 
J

Jack

I'll save them in a file for further processing.

John Machin said:
And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?
 
J

Jack

John, thanks for your reply. I will then use the files as input to generate
an index. So the
files are temporary, and provide some attributes in the index. So I do this
multiple times
to gather different attributes, merge, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top