Is there a maximum size to a Python program?

C

Carbon Man

I have a program that is generated from a generic process. It's job is to
check to see whether records (replicated from another system) exist in a
local table, and if it doesn't, to add them. I have 1 of these programs for
every table in the database. Everything works well until I do the postcode
table. The generated code is 5MB for a system with no current data. Normally
the file would not be this big as only the changes are copied over. Python
just quits, I have tried stepping through the code in the debugger but it
doesn't even start.
I am thinking that dynamically generating the programs to run might not be
such a good idea. It would be a shame to drop it because the system needs to
be generic and it runs from an XML file so the resulting code could be
pretty complex, and I am new to Python. The program did generate a pyc so it
was able to compile.
Thoughts anyone?
 
S

Steven D'Aprano

I have a program that is generated from a generic process. It's job is
to check to see whether records (replicated from another system) exist
in a local table, and if it doesn't, to add them. I have 1 of these
programs for every table in the database.


Sounds like a needlessly complicated way of doing things. Surely
replicating data from a database is a solved problem?

Everything works well until I
do the postcode table. The generated code is 5MB for a system with no
current data.

I don't understand what you mean by "with no current data". Do you mean
that you generate 5MG of source code when the original postcode table is
empty? Or 5MB of source code when the *replicated* table is empty? Or
something else?

Normally the file would not be this big as only the
changes are copied over. Python just quits, I have tried stepping
through the code in the debugger but it doesn't even start.

That depends on how your generated code looks.

Some years ago, somebody posted here complaining that his generated code
wasn't working. It turned out that he was generating a single function
looking something vaguely like this:

def function(alist):
alist[0] = 0
alist[1] = 0
alist[2] = 0
alist[3] = 0
...
alist[29999997] = 0
alist[29999998] = 0
alist[29999999] = 0
return alist

Naturally, Python would need to load all 30 million lines of code into a
*single* code object, which was too large to load.

After changing the generated code to look something like this:

def function(alist):
for i in xrange(30000000):
alist = 0
return alist

it worked perfectly.

(I am a little fuzzy on the precise details, and googling is not helping,
so please excuse any discrepancies between my memory of the thread and
the actual reality.)


I am thinking that dynamically generating the programs to run might not
be such a good idea. It would be a shame to drop it because the system
needs to be generic and it runs from an XML file so the resulting code
could be pretty complex, and I am new to Python. The program did
generate a pyc so it was able to compile.
Thoughts anyone?

Perhaps you could explain what the 5MB of code is supposed to do it, and
how you generate the code, and show a *small* number of sample lines.
 
P

Paul Hemans

Thanks for the reply,
Sounds like a needlessly complicated way of doing things. Surely
replicating data from a database is a solved problem?
The data is coming from a legacy system and going into SQLalchemy to give
database independence. We will probably update the legacy apps in the future
but it is a massive job for little benefit.
I don't understand what you mean by "with no current data". Do you mean
that you generate 5MG of source code when the original postcode table is
empty? Or 5MB of source code when the *replicated* table is empty? Or
something else?
5MB of source code when the replicated table is empty
Perhaps you could explain what the 5MB of code is supposed to do it, and
how you generate the code, and show a *small* number of sample lines.
Here is a generated file that works. Other files are essentially the same
but with more entries. This one only adds if the record does not exist, the
next stage was to update an existing record with changes.

!/usr/bin/python
# -*- coding: utf-8 -*-
import schema
import time
from datetime import *
from sqlalchemy import *
from sqlalchemy.orm import *
engine = create_engine('sqlite:///tutorial.db', echo=False)
Session = sessionmaker(bind=engine)
session = Session()

entry = schema.BILLS()
exists = session.query(schema.BILLS).filter(schema.BILLS.REFNO==u"1")
if exists.count == 0:
entry.REFNO = u"1"
entry.THRDPTY = u"""C5"""
entry.AMOUNT = 0
entry.TIMESTAMP = 0
entry.ALLOCATED = 0
session.add(entry)

session.commit()
session.close()
 
J

John Machin

I have a program that is generated from a generic process. It's job is to
check to see whether records (replicated from another system) exist in a
local table, and if it doesn't, to add them. I have 1 of these programs for
every table in the database. Everything works well until I do the postcode
table. The generated code is 5MB for a system with no current data. Normally
the file would not be this big as only the changes are copied over. Python
just quits, I have tried stepping through the code in the debugger but it
doesn't even start.
I am thinking that dynamically generating the programs to run might not be
such a good idea.

I tend to agree with that line of thinking. What you need is:
(a) ONE program. That's not one per table, that's one and only one,
period/full-stop.
(b) Per-table config info like column names and types -- you have this
already.

The idea is that instead of reading a data file and generating SQL per
row, you generate SQL once, with parameter markers for the data
values. Then you read the data file and execute the prepared SQL for
each row

More detail:

At the start, build an INSERT statement that's tailored for the table
that you are updating:

insert_sql = "INSERT INTO table_name VALUES (?,?,?,?)"
or (if you are paranoid)
"INSERT INTO table_name(col1,col2,col3,col4) VALUES (?,?,?,?)"

Notes: (a) assumes four columns as an example (b) Lower-case stuff has
to be replaced with the correct table name and column name(s) (c) "?"
is the parameter marker for the DB module you are using

For each row read from the data file, transform the input into a tuple
of suitable data types, and then do the insert:
import yourdbmodule
# much later:
try:
cursor.execute(insert_sql, data_tuple)
except yourdbmodule.IntegrityError: #
if it's the primary key not unique exception:
pass # ignoring this is part of your requirement
else:
raise # probably never happen, but should be checked
It would be a shame to drop it because the system needs to
be generic

OTOH it would be a shame to continue with a process where the amount
of code lying around is O(N_update_rows) instead of O(1) and the
number of programs is O(N_tables) instead of O(1).
and it runs from an XML file so the resulting code could be
pretty complex, and I am new to Python.

The above advice is in general language-agnostic; I'm just using
Python code because you asked in this newsgroup. The fact that other
languages may give you more rope with which to hang yourself is also
immaterial :)

HTH,
John
 
J

John Machin

[snip]
exists = session.query(schema.BILLS).filter(schema.BILLS.REFNO==u"1")
if exists.count == 0:
    entry.REFNO = u"1"
    entry.THRDPTY = u"""C5"""
    entry.AMOUNT = 0
    entry.TIMESTAMP = 0
    entry.ALLOCATED = 0
    session.add(entry)

á¼Î´Î¬ÎºÏυσεν ὠἸησοῦς
 
P

Paul Hemans

Thanks John, I understand where you are coming from and will try and digest
it all. One problem though that I didn't mention in my original posting was
that the replication may only require updating one or more fields, that is a
problem with a generating a single SQL statement to cover all requests. I am
having a look at eval and exec to see if they will give me a little more
flexibility in runtime generation of the code.

I have a program that is generated from a generic process. It's job is to
check to see whether records (replicated from another system) exist in a
local table, and if it doesn't, to add them. I have 1 of these programs
for
every table in the database. Everything works well until I do the postcode
table. The generated code is 5MB for a system with no current data.
Normally
the file would not be this big as only the changes are copied over. Python
just quits, I have tried stepping through the code in the debugger but it
doesn't even start.
I am thinking that dynamically generating the programs to run might not be
such a good idea.

I tend to agree with that line of thinking. What you need is:
(a) ONE program. That's not one per table, that's one and only one,
period/full-stop.
(b) Per-table config info like column names and types -- you have this
already.

The idea is that instead of reading a data file and generating SQL per
row, you generate SQL once, with parameter markers for the data
values. Then you read the data file and execute the prepared SQL for
each row

More detail:

At the start, build an INSERT statement that's tailored for the table
that you are updating:

insert_sql = "INSERT INTO table_name VALUES (?,?,?,?)"
or (if you are paranoid)
"INSERT INTO table_name(col1,col2,col3,col4) VALUES (?,?,?,?)"

Notes: (a) assumes four columns as an example (b) Lower-case stuff has
to be replaced with the correct table name and column name(s) (c) "?"
is the parameter marker for the DB module you are using

For each row read from the data file, transform the input into a tuple
of suitable data types, and then do the insert:
import yourdbmodule
# much later:
try:
cursor.execute(insert_sql, data_tuple)
except yourdbmodule.IntegrityError: #
if it's the primary key not unique exception:
pass # ignoring this is part of your requirement
else:
raise # probably never happen, but should be checked
It would be a shame to drop it because the system needs to
be generic

OTOH it would be a shame to continue with a process where the amount
of code lying around is O(N_update_rows) instead of O(1) and the
number of programs is O(N_tables) instead of O(1).
and it runs from an XML file so the resulting code could be
pretty complex, and I am new to Python.

The above advice is in general language-agnostic; I'm just using
Python code because you asked in this newsgroup. The fact that other
languages may give you more rope with which to hang yourself is also
immaterial :)

HTH,
John
 
E

Emmanuel Surleau

I have a program that is generated from a generic process. It's job is to
check to see whether records (replicated from another system) exist in a
local table, and if it doesn't, to add them. I have 1 of these programs for
every table in the database. Everything works well until I do the postcode
table. The generated code is 5MB for a system with no current data.
Normally the file would not be this big as only the changes are copied
over. Python just quits, I have tried stepping through the code in the
debugger but it doesn't even start.
I am thinking that dynamically generating the programs to run might not be
such a good idea. It would be a shame to drop it because the system needs
to be generic and it runs from an XML file so the resulting code could be
pretty complex, and I am new to Python. The program did generate a pyc so
it was able to compile.
Thoughts anyone?

I agree with what most people here have said: don't generate Python code,
it's a bad idea. Put your data in CSV files instead (one per table, named
after the table, for instantce). You might want to either have a separate file
with metadata (column names, etc...) or have the first line be a header:

"<col1_name:col1_type>","<col2_name:col2_type>"...
"<data_for_col1>","<data_for_col2>",...
"<data_for_col1>","<data_for_col2>",...

Then use the csv module (in the standard library) to read the file.

If your original data is in XML, you might want to parse it using lxml
instead.

Cheers,

Emm
 
J

John Machin

Thanks John, I understand where you are coming from and will try and digest
it all. One problem though that I didn't mention in my original posting was
that the replication may only require updating one or more fields, that is a
problem with a generating a single SQL statement to cover all requests.

OK so you need TWO programs -- one tailored for initial loading of a
table, the other for subsequent updates. Still O(1) :)

OR you could still do it in one, by using the prepared whole-row SQL
statement if appropriate, or building an SQL statement on the fly if
it's an update.
I am
having a look at eval and exec

WRONG WAY
GO BACK
to see if they will give me a little more
flexibility in runtime generation of the code.

And another thing that you didn't mention was that you are using
SQLAlchemy -- perhaps you might like to ask your question on that
package's forum ... including a few more facts than heretofore :)

HTH
John
 
J

John Machin

entry = schema.BILLS()
exists = session.query(schema.BILLS).filter(schema.BILLS.REFNO==u"1")
if exists.count == 0:
    entry.REFNO = u"1"
    entry.THRDPTY = u"""C5"""
    entry.AMOUNT = 0
    entry.TIMESTAMP = 0
    entry.ALLOCATED = 0

Instead of generating code like that, you can do it dynamically. The
built-in setattr() function is your friend, e.g.

for col_name in col_names:
if needs_updating(col_name):
value = get_new_value(col_name)
setattr(entry, col_name, value)
    session.add(entry)

session.commit()
session.close()

Cheers,

John
 
L

Lawrence D'Oliveiro

One problem though that I didn't mention in my original
posting was that the replication may only require updating one or more
fields, that is a problem with a generating a single SQL statement to
cover all requests.

That's not a big issue. Assume the field names and corresponding values are
coming from a Python dict, eg

FieldValues = \
{
"field1" : ... value1 ...;
"field2" : ... value2 ...;
}

then you can construct an SQL statement on the fly with something like

sqlcmd = \
(
"insert into my_table set "
+
", ".join
(
"%s = %s" % (k, SQLString(FieldValues[k]))
for k in FieldValues.keys()
)
)

where SQLString is as defined at
<http://codecodex.com/wiki/index.php?title=Useful_MySQL_Routines#Quoting>.
 
L

Lawrence D'Oliveiro

Lawrence D'Oliveiro said:
One problem though that I didn't mention in my original
posting was that the replication may only require updating one or
more fields, that is a problem with a generating a single SQL
statement to cover all requests.

That's not a big issue. Assume the field names and corresponding
values are coming from a Python dict, eg

FieldValues = \
{
"field1" : ... value1 ...;
"field2" : ... value2 ...;
}

then you can construct an SQL statement on the fly with something like

sqlcmd = \
(
"insert into my_table set "
+
", ".join
(
"%s = %s" % (k, SQLString(FieldValues[k]))
for k in FieldValues.keys()
)
)

where SQLString is as defined at
<http://codecodex.com/wiki/index.php? title=Useful_MySQL_Routines#Quoting

Not so nice if the input data is something like:

FieldValues = { "field1=0);DROP my_table;": "" }

So you want something to validate fieldnames.

That's not going to happen. These are field names, not values you're talking
about.
Also you are assuming that all the values are strings ...

No I'm not.
 
G

Gabriel Genellina

base64 encoding

Yes, my news client kindly decoded it for me.
It looks like Greek, and I can transliterate the last word as Iesous, so
it might be some kind of Biblical reference. But I don't have the
slightiest idea of what that means...
 
G

Gabriel Genellina

En Mon, 27 Apr 2009 14:50:23 -0300, Gabriel Genellina
Yes, my news client kindly decoded it for me.
It looks like Greek, and I can transliterate the last word as Iesous, so
it might be some kind of Biblical reference. But I don't have the
slightiest idea of what that means...

Answering to myself: Jesus wept.
http://en.wikipedia.org/wiki/Jesus_wept#Use_as_an_expletive

(This reminds me a book by Ellery Queen that I read a (very long) time
ago, the phrase "And Ellery cried" was a complete chapter by itself. I
didn't catch the Gospel reference that time - I thought it was just a
matter of style, a very short (and dramatic) chapter).
 
P

Paul Hemans

The reason I started down this path is that the program is grinding to a
halt (my PC as well) and the only way to get it going again is:2524104

Memory used by the process is around 500MB, I have posted about this problem
before on this newsgroup but I didn't get a resolution. So I was trying to
separate the logic into different files and finish each one with
gc.collect().
Sorry I didn't post the original problem as well but I really just wanted
the thing to work and not cloud the issue again.
I have implemented your approach and the code is much cleaner (except for
another problem that I have posted to sqlAlchemy). It doesn't solve my
original problem with the memory, but I really need SOMETHING to work. So I
will keep plodding away and then maybe move the XML processing to lxml and
see if that resolves the memory problem.


Thanks John, I understand where you are coming from and will try and
digest
it all. One problem though that I didn't mention in my original posting
was
that the replication may only require updating one or more fields, that is
a
problem with a generating a single SQL statement to cover all requests.

OK so you need TWO programs -- one tailored for initial loading of a
table, the other for subsequent updates. Still O(1) :)

OR you could still do it in one, by using the prepared whole-row SQL
statement if appropriate, or building an SQL statement on the fly if
it's an update.
I am
having a look at eval and exec

WRONG WAY
GO BACK
to see if they will give me a little more
flexibility in runtime generation of the code.

And another thing that you didn't mention was that you are using
SQLAlchemy -- perhaps you might like to ask your question on that
package's forum ... including a few more facts than heretofore :)

HTH
John
 
P

Paul Hemans

Hi Andrew,
The reason I am using mapped objects is that I need to abstract from the
database implementation allowing the replication to target a number of
different platforms. This will definitely slow things down.
process a whole pile in memory and then (perhaps every 10,000 - when your
memory is about to run out and start paging) flush the session.
Under windows how can I tell when memory is about to run out? I guess there
is no cross-platform solution to this.
Writing external files has all come about from a post titled "Memory
problems (garbage collection) by Carbon Man" which I never got a resolution
to.
I was trying to execute gc.collect() when a process was complete because I
was having huge problems with memory (though the problem still remains). If
I stop at "import schema" There are 2524104 objects processed by
gc.collect()

There is a new issue in the code (marked with #### Problem with SQLalchemy),
but here is the program as it now stands:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Reads an XML file and creates schema.py from the TABLES branch (very
simplistic table structures)
# Processes the nodes within the DATA branch, into SQLalchemy
from xml.dom import minidom
import os
import codecs
from cStringIO import StringIO
import time
from datetime import *
from sqlalchemy import *
from sqlalchemy.orm import *

class xmlProcessing:
""" General class for XML processing"""

def process(self, filename="", xmlString=""):
if xmlString:
pass
elif filename:
xmldoc = minidom.parse(filename)
self.parse( xmldoc.documentElement )

def parseBranch(self, parentNode):
""" Process an XML branch """
for node in parentNode.childNodes:
try:
parseMethod = getattr(self, "parse_%s" %
node.__class__.__name__)
except AttributeError:
continue
if parseMethod(node):
continue
self.parseBranch(node)

def parse_Document(self, node):
pass

def parse_Text(self, node):
pass

def parse_Comment(self, node):
pass

def parse_Element(self, node):
try:
handlerMethod = getattr(self, "do_%s" % node.tagName)
except AttributeError:
return False
handlerMethod(node)
return True

class reptorParsing(xmlProcessing):
""" Specific class for generating a SQLalchemy program to create tables
and populate them with data"""

def __init__(self):
self.schemaPreface = StringIO()
self.schemaPreface.write("""from sqlalchemy import *
import time
from datetime import *
from sqlalchemy.ext.declarative import declarative_base
#engine = create_engine('sqlite:///tutorial.db', echo=False)
#metadata = MetaData()
Base = declarative_base()""")
self.schemaTables = StringIO()
self.schemaTablesCreate = StringIO()
self.schemaFields = StringIO()
self.tableDict = {}
self.tableName = StringIO()
self.tables = StringIO()

def parse(self, parentNode):
"""Main entry point to begin processing a XML document"""
self.parseBranch(parentNode)
# Properties such as schemaTables and .tables are populated by the
various methods below
fupdate=codecs.open(os.path.join(os.getcwd(), "update.py"), 'w',
'UTF-8')
if self.schemaTables:
f=codecs.open(os.path.join(os.getcwd(), "schema.py"), 'w',
'UTF-8')
f.write(self.schemaPreface.getvalue()+u"\n"+self.schemaTables.getvalue()+
u"if __name__ == '__main__':\n" +
self.schemaTablesCreate.getvalue() +
u"\n engine = create_engine('sqlite:///tutorial.db',
echo=False)\n" +
u" metadata = Base.metadata\n" +
u" metadata.create_all(engine)\n")
f.close()
if self.tables:
fupdate.write(self.tables.getvalue())
fupdate.close()

def do_TABLES(self, tableNode):
"""Process schema for tables"""
for node in tableNode.childNodes:
self.tableName = node.tagName
# Define a declaritive mapping class
self.schemaTables.write("""\nclass %s(Base):
__tablename__ = '%s'
""" % (self.tableName, self.tableName))
self.schemaTablesCreate.write("\n tableDef =
"+self.tableName+"()")
self.schemaFields = StringIO()
# allow for userA = users("Billy","Bob") via a __init__()
self.schemaInitPreface = StringIO()
self.schemaInitPreface.write(" def __init__(self")
self.schemaInitBody = StringIO()
self.parseBranch(node)
self.schemaInitPreface.write("):\n")
self.schemaTables.write(self.schemaFields.getvalue() + "\n" + \
self.schemaInitPreface.getvalue() + \
self.schemaInitBody.getvalue() + "\n")

# Need a way to execute schema so that the tables will be created.

def do_FIELDS(self, fieldsNode):
"""Process schema for fields within tables"""
for node in fieldsNode.childNodes:
if self.schemaFields:
self.schemaFields.write("\n")
cType = ""
initType = '""'
# The attribute type holds the type of field
crType = node.attributes["type"].value
if crType==u"C":
cType = "String(length=%s)" % node.attributes["len"].value
initType = '""'
elif crType==u"N" and node.attributes["dec"].value==u'0':
cType = "Integer"
initType = '0'
elif crType==u"N":
cType = "Numeric(precision=%s, scale=%s)" %
(node.attributes["len"].value,node.attributes["dec"].value)
initType = '0'
elif crType==u"L":
cType = "Boolean"
initType = 'True'
elif crType==u"T":
cType = "DateTime"
initType = "datetime.strptime('','')"
elif crType==u"D":
cType = "Date"
initType = "datetime.strptime('','')"
elif crType==u"M" or crType==u"G":
cType = "Text"
initType = '""'

if node.attributes.getNamedItem("primary"):
cType += ", primary_key=True"
self.schemaFields.write(" %s = Column(%s)" % (node.tagName,
cType))
self.schemaInitPreface.write(", \\\n %s=%s" %
(node.tagName, initType))
self.schemaInitBody.write(" self.%s = %s\n" %
(node.tagName, node.tagName))
self.tableDict[self.tableName + "." + node.tagName] = crType

def do_DATA(self, dataNode):
"""This is for processing actual data to be pushed into the tables

Layout is DATA -> TABLE_NAME key='primary_field' -> TUPLE ->
FIELD_NAME -> VALUE"""
# This is the schema program created in the TABLES branch
import schema
engine = create_engine('sqlite:///tutorial.db', echo=False)
Session = sessionmaker(bind=engine)
self.session = Session()

for node in dataNode.childNodes:
self.tableName = node.tagName

self.keyValue = ""
self.keyField = node.attributes["key"].value
self.parseBranch(node)
#self.tables.write("\nimport update_%s" % (self.tableName))
self.fDataUpdate.close()
self.session.commit()

self.session.close()

def do_TUPLE(self, tupleNode):
""" A TUPLE is what the XML file refers to a table row.
Sits below a DATA->table_name branch """

import schema
colValues = {}
TI = eval("schema."+self.tableName+"()")
exists = None
for node in tupleNode.childNodes:
for dataNode in node.childNodes:
crType = self.tableDict[self.tableName + "." + node.tagName]

if crType==u"C" or crType==u"M":
cValue = dataNode.data
elif crType==u"T":
cValue = datetime.strptime(dataNode.data, "%Y-%m-%d
%H:%M")
elif crType==u"D":
cValue = datetime.strptime(dataNode.data, "%Y-%m-%d")
else:
cValue = dataNode.data
if node.tagName == self.keyField:
#### Problem with SQLalchemy.
exists =
self.session.query(TI).filter(getattr(TI,self.keyField)==cValue)
# Set the value onto a list to be processed below
colValues[node.tagName] = cValue

if exists is None:
return

if exists.count == 0:
entry = TI
else:
entry = exists[0]

for col_name in colValues:
setattr(entry, col_name, colValues[col_name])
if exists.count == 0:
# Insert
self.session.add(TI)


if __name__ == '__main__':
replicate = reptorParsing()
replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
del replicate
 
P

Paul Hemans

Hi,
Please see my post titled "The whole story"

Martin P. Hellwig said:
<cut>
To answer the topic question, it would be limited to the memory your
platform can allocate and I guess that the most critical part would be if
you run out of memory for handling the namespace references.

However my magic 8 ball whispers to me that the provided solution tries to
be smarter than the given problem requires, which is not a good thing. If
so and if possible, please provide a more detailed explanation of the
actual problem you want to solve instead of the problem you currently have
with the provided solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top