Parsing strings -> numbers

T

Tuang

I've been looking all over in the docs, but I can't figure out how
you're *supposed* to parse formatted strings into numbers (and other
data types, for that matter) in Python.

In C#, you can say

int.Parse(myString)

and it will turn a string like "-12,345" into a proper int. It works
for all sorts of data types with all sorts of formats, and you can
pass it locale parameters to tell it, for example, to parse a German
"12.345,67" into 12345.67. Java does this, too.
(Integer.parseInt(myStr), IIRC).

What's the equivalent in Python?

And if the only problem is comma thousand-separators (e.g.,
"12,345.67"), is there a higher-performance way to convert that into
the number 12345.67 than using Python's formal parsers?

Thanks.
 
S

Skip Montanaro

tuanglen> I've been looking all over in the docs, but I can't figure out
tuanglen> how you're *supposed* to parse formatted strings into numbers
tuanglen> (and other data types, for that matter) in Python.

Check out the locale module. From "pydoc locale":

Help on module locale:

NAME
locale - Locale support.

FILE
/Users/skip/local/lib/python2.4/locale.py

MODULE DOCS
http://www.python.org/doc/current/lib/module-locale.html

DESCRIPTION
The module provides low-level access to the C lib's locale APIs
and adds high level number formatting APIs as well as a locale
aliasing engine to complement these.

...

FUNCTIONS
atof(str, func=<type 'float'>)
Parses a string as a float according to the locale settings.

atoi(str)
Converts a string to an integer according to the locale settings.

...

Skip
 
M

Miki Tebeka

Hello Tuang,
In C#, you can say

int.Parse(myString)

and it will turn a string like "-12,345" into a proper int. It works
for all sorts of data types with all sorts of formats, and you can
pass it locale parameters to tell it, for example, to parse a German
"12.345,67" into 12345.67. Java does this, too.
(Integer.parseInt(myStr), IIRC).

What's the equivalent in Python?
Python has a build in "int", "long" and "float" functions. However
they are more limited than what you want.
And if the only problem is comma thousand-separators (e.g.,
"12,345.67"), is there a higher-performance way to convert that into
the number 12345.67 than using Python's formal parsers?
i = int("12,345.67".replace(",", ""))

HTH.
Miki
 
T

Tuang

Skip Montanaro said:
tuanglen> I've been looking all over in the docs, but I can't figure out
tuanglen> how you're *supposed* to parse formatted strings into numbers
tuanglen> (and other data types, for that matter) in Python.

Check out the locale module. From "pydoc locale":

Help on module locale:

NAME
locale - Locale support.

FILE
/Users/skip/local/lib/python2.4/locale.py

MODULE DOCS
http://www.python.org/doc/current/lib/module-locale.html

DESCRIPTION
The module provides low-level access to the C lib's locale APIs
and adds high level number formatting APIs as well as a locale
aliasing engine to complement these.

...

FUNCTIONS
atof(str, func=<type 'float'>)
Parses a string as a float according to the locale settings.

atoi(str)
Converts a string to an integer according to the locale settings.

...

Thanks for taking a shot at it, but it doesn't appear to work:
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python2321\lib\locale.py", line 179, in atoi
return atof(str, int)
File "C:\Python2321\lib\locale.py", line 175, in atof
return func(str)
ValueError: invalid literal for int(): -12,345-12345

Given the locale it thinks I have, it should be able to parse
"-12,345" if it can handle formats containing thousands separators,
but apparently it can't.

If Python doesn't actually have its own parsing of formatted numbers,
what's the preferred Python approach for taking taking data, perhaps
formatted currencies such as "-$12,345.00" scraped off a Web page, and
turning it into numerical data?

Thanks.
 
D

Duncan Booth

(e-mail address removed) (Tuang) wrote in
-12345

Given the locale it thinks I have, it should be able to parse
"-12,345" if it can handle formats containing thousands separators,
but apparently it can't.

If Python doesn't actually have its own parsing of formatted numbers,
what's the preferred Python approach for taking taking data, perhaps
formatted currencies such as "-$12,345.00" scraped off a Web page, and
turning it into numerical data?

The problem is that by default the numeric locale is not set up to parse
those numbers. You have to set that up separately:
import locale
locale.getlocale(locale.LC_NUMERIC) (None, None)
locale.getlocale() ['English_United Kingdom', '1252']
locale.setlocale(locale.LC_NUMERIC, "English") 'English_United States.1252'
locale.atof('1,234') 1234.0
locale.setlocale(locale.LC_NUMERIC, "French") 'French_France.1252'
locale.atof('1,234')
1.234

Unless I've missed something, it doesn't support ignoring currency symbols
when parsing numbers, so you still can't handle "-$12,345.00" even if you
do set the numeric and monetary locales.
 
S

Skip Montanaro

tuang> Thanks for taking a shot at it, but it doesn't appear to work:
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python2321\lib\locale.py", line 179, in atoi
return atof(str, int)
File "C:\Python2321\lib\locale.py", line 175, in atof
return func(str)
ValueError: invalid literal for int(): -12,345 -12345

Take a look at the output of locale.localeconv() with various locales set.
I think you'll find that locale.localeconv()['tousands_sep'] is '', not ','.
Failing that, you might want to simply replace the commas and dollar signs
with empty strings before passing to int() or float(), as someone else
suggested.

Be careful if you're scraping web pages which might not use the same charset
as you do. You may find something like:

$123.456,78

as a quote price on a European website. I don't know how to tell what the
remote site used as its locale when formatting numeric data. Perhaps
knowing the charset of the page is sufficient to make an educated guess.

Skip
 
T

Tuang

Skip Montanaro said:
Be careful if you're scraping web pages which might not use the same charset
as you do. You may find something like:

$123.456,78

as a quote price on a European website. I don't know how to tell what the
remote site used as its locale when formatting numeric data. Perhaps
knowing the charset of the page is sufficient to make an educated guess.

Thanks, Skip. I'm not planning some sort of shady screen scraping
operation or anything of that sort. This is more of a generic question
about how to use Python as a convenient utility language.

Sometimes I'll find a table of interesting data somewhere as I'm just
surfing around the Web, and I'll want to grab the data and play with
it a bit. At that scale of operation, I can just look at the page
source and figure out the encoding, what the currency is, etc. I know
how to turn a formatted string into a usable number in other languages
that I use (though I might have to check the docs in some cases to
remind myself of the details), and since the docs didn't really make
it obvious what the "one clear and obvious way to do it" was in
Python, I thought I'd ask.

It appears as though Python doesn't (yet) have the same formal support
for format parsing and internationalization that languages like C# and
Java have, but that's okay for now. I just wanted to make sure I
didn't start creating my own naive, homemade equivalents of functions
that are already part of the standard API.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,171
Messages
2,570,936
Members
47,472
Latest member
KarissaBor

Latest Threads

Top