S
Simon Mullis
Hi All,
I'm writing a Django App to webify a Python script I wrote that parses
some big XML docs and summarizes certain data contained within. This
app will be used by a closed group of people, all with their own login
credentials to the system (backed off to the corp SSO system). I've
already got Django, Celery and Rabbitmq working to handle uploads and
passing the "offline" parsing / processing task to the backend.
One way to do the XML processing is to use a number of XPath queries
and a key point is the required output (and therefore the list of
queries) is not yet complete (and will always be a little "fluid").
Ultimately, I don't want to be stuck with maintaining this list of
XPath queries and want to allow my colleagues to add / change / remove
them as they see fit. I didn't know my "xpath from my elbow" when I
first started looking into this so I think they'll be able to learn it
pretty easily.
The only easy way to accomplish this I can think of is create a new
table with { "name" : "xpath command" } key, value pairs and allow it
to be edited via the browser (using Django's Admin scaffolding) .
I plan on making sure that when amending the list of XPath queries,
each entry is tested against a known good XML file so the user can see
sample results before they're allowed to save it to the database. I'm
aware that this will probably slow things down, as there'll be lots of
serial XPath queries, but even if the job takes 5 minutes longer then
it will still save man-days of manual work. I can live with this as
long as it's flexible.
# In the meantime - and as a proof of concept - I'm using a dict instead.
xpathlib = {
"houses" : r'[ y.tag for y in x.xpath("//houses/*") ]',
"names" : r'[ y.text for y in x.xpath("//houses/name") ]',
"footwear_type" : r'[ y.tag for y in
x.xpath("//cupboard/bottom_shelf/*") ]',
"shoes" : r'[ y.text for y in
x.xpath("//cupboard/bottom_shelf/shoes/*") ]',
"interface_types" : r'[ y.text[:2] for y in
x.xpath("//interface/name") ]',
}
# (I left a real one at the bottom so you can see I might want to do
some string manipulation on the results within the list
comprehension).
# Then in my backend task processing scripts I would have something like:
def apply_xpath(xpath_command, xml_frag):
x = xml_frag
if re.findall(r'\[\sy\.\w+(?:\[.+\])?\s+for y in
x\.xpath\("/{1,2}[\w+|\/|\*]+"\)\s\]', xml_frag):
result = eval(xpath_command, {"__builtins__":[]},{"x": x})
if type(result).__name__ == "list":
return result
Yes, the regex is pretty unfriendly, but it works for all of the cases
above and fails for other non-valid examples I've tried.
# Let's try it in ipython:
from lxml import etree
f = open('/tmp/sample.xml')
xml_frag = etree.fromstring(f.read())
print apply_xpath(xpathlib["footwear_type"], xml_frag)
['ballet shoes', 'loafers', 'ballet shoes', 'hiking boots', 'training
shoes', 'hiking boots', 'socks']
Is this approach ok?
I've been reading about this and know about the following example that
bypass the __builtins__ restriction on eval (edited so it won't work).
__import__('shutill').rmtreee('/') <===DO NOT TRY AND RUN THIS
As the potential audience for this Web App is both known and
controlled, am I being too cautious?
If "eval" is not the way forward, are there any suggestions for
another way to do this?
Thanks
I'm writing a Django App to webify a Python script I wrote that parses
some big XML docs and summarizes certain data contained within. This
app will be used by a closed group of people, all with their own login
credentials to the system (backed off to the corp SSO system). I've
already got Django, Celery and Rabbitmq working to handle uploads and
passing the "offline" parsing / processing task to the backend.
One way to do the XML processing is to use a number of XPath queries
and a key point is the required output (and therefore the list of
queries) is not yet complete (and will always be a little "fluid").
Ultimately, I don't want to be stuck with maintaining this list of
XPath queries and want to allow my colleagues to add / change / remove
them as they see fit. I didn't know my "xpath from my elbow" when I
first started looking into this so I think they'll be able to learn it
pretty easily.
The only easy way to accomplish this I can think of is create a new
table with { "name" : "xpath command" } key, value pairs and allow it
to be edited via the browser (using Django's Admin scaffolding) .
I plan on making sure that when amending the list of XPath queries,
each entry is tested against a known good XML file so the user can see
sample results before they're allowed to save it to the database. I'm
aware that this will probably slow things down, as there'll be lots of
serial XPath queries, but even if the job takes 5 minutes longer then
it will still save man-days of manual work. I can live with this as
long as it's flexible.
# In the meantime - and as a proof of concept - I'm using a dict instead.
xpathlib = {
"houses" : r'[ y.tag for y in x.xpath("//houses/*") ]',
"names" : r'[ y.text for y in x.xpath("//houses/name") ]',
"footwear_type" : r'[ y.tag for y in
x.xpath("//cupboard/bottom_shelf/*") ]',
"shoes" : r'[ y.text for y in
x.xpath("//cupboard/bottom_shelf/shoes/*") ]',
"interface_types" : r'[ y.text[:2] for y in
x.xpath("//interface/name") ]',
}
# (I left a real one at the bottom so you can see I might want to do
some string manipulation on the results within the list
comprehension).
# Then in my backend task processing scripts I would have something like:
def apply_xpath(xpath_command, xml_frag):
x = xml_frag
if re.findall(r'\[\sy\.\w+(?:\[.+\])?\s+for y in
x\.xpath\("/{1,2}[\w+|\/|\*]+"\)\s\]', xml_frag):
result = eval(xpath_command, {"__builtins__":[]},{"x": x})
if type(result).__name__ == "list":
return result
Yes, the regex is pretty unfriendly, but it works for all of the cases
above and fails for other non-valid examples I've tried.
# Let's try it in ipython:
from lxml import etree
f = open('/tmp/sample.xml')
xml_frag = etree.fromstring(f.read())
print apply_xpath(xpathlib["footwear_type"], xml_frag)
['ballet shoes', 'loafers', 'ballet shoes', 'hiking boots', 'training
shoes', 'hiking boots', 'socks']
Is this approach ok?
I've been reading about this and know about the following example that
bypass the __builtins__ restriction on eval (edited so it won't work).
__import__('shutill').rmtreee('/') <===DO NOT TRY AND RUN THIS
As the potential audience for this Web App is both known and
controlled, am I being too cautious?
If "eval" is not the way forward, are there any suggestions for
another way to do this?
Thanks