cleaning up an ASCII file?

N

Nick Matzke

Hi all,

So I'm parsing an XML file returned from a database. However, the
database entries have occasional non-ASCII characters, and this is
crashing my parsers.

Is there some handy function out there that will schlep through a file
like this, and do something like fix the characters that it can
recognize, and delete those that it can't? Basically, like the BBEdit
"convert to ASCII" menu option under "Text".

I googled some on this, but nothing obvious came up that wasn't specific
to fixing one or a few characters.

Thanks!
Nick


--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: (e-mail address removed)

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
 
J

John Machin

Hi all,

So I'm parsing an XML file returned from a database.  However, the
database entries have occasional non-ASCII characters, and this is
crashing my parsers.

So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(

Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...
Center for Theoretical Evolutionary Genomics

If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)
 
N

Nick Matzke

Apologies, I figured there was some easy, obvious solution, since there
is in BBedit. I will explain further...

John said:
So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(

Not if it's just weird versions of dash characters and umlauted
characters the like, which is what I bet it is. Those sorts of things
and the apparent inability of lots of email readers and websites to deal
with them have been annoying me for years, so I tend to move straight
towards genocidal tactics when I detect their presence.

(My database source is GBIF, they get museum specimen submissions from
around the planet, there are zillions of records, I am just a user, so
fixing it on their end is not a realistic option.)
Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...

All that is fine, the problem is actually when I try to print to screen
in IPython:

============
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 293: ordinal not in range(128)
============

Probably this is the line in the file which is causing problems (as
displayed in BBedit):

======================
<gbif:statements>-

This document contains data shared through the GBIF Network - see
http://data.gbif.org/ for more information.

All usage of these data must be in accordance with the GBIF Data Use
Agreement - see http://www.gbif.org/DataProviders/Agreements/DUA

Please cite these data as follows:

Jyväskylä University Museum - The Section of Natural Sciences,
Vascular plant collection of Jyvaskyla University Museum (accessed
through GBIF data portal, http://data.gbif.org/datasets/resource/462,
2009-06-11)
Missouri Botanical Garden, Missouri Botanical Garden (accessed through
GBIF data portal, http://data.gbif.org/datasets/resource/621, 2009-06-11)
Museo Nacional de Costa Rica, herbario (accessed through GBIF data
portal, http://data.gbif.org/datasets/resource/566, 2009-06-11)
National Science Museum, Japan, Kurashiki Museum of Natural History
(accessed through GBIF data portal,
http://data.gbif.org/datasets/resource/599, 2009-06-11)
The Swedish Museum of Natural History (NRM), Herbarium of Oskarshamn
(OHN) (accessed through GBIF data portal,
http://data.gbif.org/datasets/resource/1024, 2009-06-11)
Tiroler Landesmuseum Ferdinandeum, Tiroler Landesmuseum Ferdinandeum
(accessed through GBIF data portal,
http://data.gbif.org/datasets/resource/1509, 2009-06-11)
UCD, Database Schema for UC Davis [Herbarium Labels] (accessed through
GBIF data portal, http://data.gbif.org/datasets/resource/734, 2009-06-11)

-
</gbif:statements>
======================


Presumably "Jyväskylä University Museum" is the problem since
there are umlauted a's in there. (Note, though, that I have thousands of
records to parse, so there is going to be all kinds of other umlauted &
accented stuff in these sorts of search results.

So the goal is to replace the characters with un-umlauted versions or
some such.

Cheers!
Nick


PS: versions I am using:
========
nick$ python -V
Python 2.5.2 |EPD Py25 4.1.30101|
========



If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)

....its easier to have a big sig than to try and remember all that stuff
;-)...




--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: (e-mail address removed)

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
 
N

Nick Matzke

Looks like this was a solution:

1. Use this guy's unescape function to convert from HTML/XML Entities to
unicode
http://effbot.org/zone/re-sub.htm#unescape-html


2. Take the unicode and convert to approximate plain ASCII matches with
unicodedata (after import unicodedata)


ascii_content2 = unescape(line)

ascii_content = unicodedata.normalize('NFKD',
unicode(ascii_content2)).encode('ascii','ignore')


The string "line" would give the error, but ascii_content does not.

Cheers!
Nick

PS: "asciiDammit" is also fun to look at




John said:
So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(

Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...


If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)

--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: (e-mail address removed)

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
 
J

John Machin

Nick Matzke said:
Looks like this was a solution:

1. Use this guy's unescape function to convert from HTML/XML Entities to
unicode
http://effbot.org/zone/re-sub.htm#unescape-html

Looks like you didn't notice "this guy"'s unaccent.py :)
http://effbot.org/zone/unicode-convert.htm

[Aside: Has anyone sighted the effbot recently? He's been very quiet.]
2. Take the unicode and convert to approximate plain ASCII matches with
unicodedata (after import unicodedata)

ascii_content2 = unescape(line)

ascii_content = unicodedata.normalize('NFKD',
unicode(ascii_content2)).encode('ascii','ignore')

The normalize hack gets you only so far. Many Latin-based characters are not
decomposable. Look for the thread in this newsgroup with subject "convert
unicode characters to visibly similar ascii characters" around 2008-07-01 or
google("hefferon unicode2ascii")

Alternative: If you told us which platform you are running on, people familiar
with that platform could help you set up your terminal to display non-ASCII
characters correctly.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,237
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top