trying to parse XML from an email...

R

rtl

I am retrieving an email with XML content. I strip off the email
headings and save only the XML portion to a file. Subsequently I
parse through the XML using XML::Simple.

Most times the XML parses fine, but there are times when there is some
extra encoding, such that all ='s (equal signs) are followed by '3D'.
This seems to happen when certain characters such as TM or (R) or (c)
are included in some of the data.

I do not know what modules may be helpful so I can successfully decode
the XML correctly. I have tried mime::quotedprint which successfully
fixes most of the encoding, but there are times when characters such
as TM come in as =E2=84=A2 but is not translated as a single character
but as three separate characters.

Any help would be great. I am running Perl 5.6.1 on Win2k. I am
swimming in modules and terms including MIME, base64 and UTF-8 -- but
to no avail.

Thank you.
 
B

Bastian Ballmann

rtl said:
Most times the XML parses fine, but there are times when there is some
extra encoding, such that all ='s (equal signs) are followed by '3D'.

3D is the hexadecimal representation of the character '=' in ASCII.
Maybe there is some encoding stuff fooling your parser and you should
first decode hexadecimal values to ascii?
HTH & Greets

Basti
 
N

news

rtl said:
I am retrieving an email with XML content. I strip off the email
headings and save only the XML portion to a file. Subsequently I
parse through the XML using XML::Simple.
Most times the XML parses fine, but there are times when there is some
extra encoding, such that all ='s (equal signs) are followed by '3D'.
This seems to happen when certain characters such as TM or (R) or (c)
are included in some of the data.

You need to handle the content encoding of the message correctly. Use an
email MIME parser (see MIME::Tools and its MIME::parser) to deconstruct
the message and them pump the decoded XML into your XML parser.

Chris
 
J

Joe Smith

I am retrieving an email with XML content. I strip off the email
headings and save only the XML portion to a file.

You should not blindly strip off all of the email headers, especially
the ones that state how the rest of the message is encoded.
Watch out for quoted-printable, uuencode, and BASE64.

You should use a MIME module to parse the message, locate the
attachment, decode it in a manner consistent with the attachment's
headers, and write the decoded data to a file.
-Joe
 
R

rtl

You should not blindly strip off all of the email headers, especially
the ones that state how the rest of the message is encoded.
Watch out for quoted-printable, uuencode, and BASE64.

You should use a MIME module to parse the message, locate the
attachment, decode it in a manner consistent with the attachment's
headers, and write the decoded data to a file.
-Joe

I've taken everyone's advice and used a MIME module instead of
stripping the headers. So far it's worked like a charm. Now, I still
need some more advice. I am sending the data from the XML back out
as: 1) an email and 2) MySQL inserts.

I am trying to now figure out how to "encode" them properly so the
special symbols (trademark, etc) display properly on the receiving
ends (email, SQL). I have tried setting the bits => "8" part of the
net::smtp module when sending the mail, but no luck yet.

Any more pointers? So far your responses have been great. Thanks
again!
 
A

Alan J. Flavell

I've taken everyone's advice and used a MIME module instead of
stripping the headers.

Quite right, too.
So far it's worked like a charm. Now, I still
need some more advice. I am sending the data from the XML back out
as: 1) an email and 2) MySQL inserts.

I am trying to now figure out how to "encode" them properly so the
special symbols (trademark, etc) display properly on the receiving
ends (email, SQL).

If you knew what you wanted, I don't think you'd have the slightest
difficulty in writing Perl code to achieve it. So you don't seem to
have a Perl problem.
I have tried setting the bits => "8" part of the
net::smtp module when sending the mail, but no luck yet.

That sounds almost as bad an error report at "it doesn't work".

I'd have no difficulty in expressing the trademark sign in XML, in
a number of different ways, but unless you understand enough XML I'm
not sure what sense you'd make of the answer.

Once you've got your XML representation sorted out, then MIME-encoding
them for transmission as email is pretty-much the converse of what
you've already done, no? And indeed you'd do it with an appropriate
MIME module.
Any more pointers?

If you're not comfortable with handling character codings, then the
safest way to represent non-US-ASCII characters in XML is as
character references. It's far from being the most economical
representation, especially if there are a lot of non-US-ASCII
characters of course, but it's pretty safe. Look up the characters at
the Unicode site - or most of the ones commonly needed are listed at
the end of the HTML4.01 specification.[1]

I'm afraid the corresponding answer in a MySQL context is beyond my
usual working range, so I'll leave that for someone else.

good luck

[1] don't be misled by bogus references that include displayable
characters between 127 and 159 decimal inclusive.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top