Ruby method to strip out XML codes?

  • Thread starter Michael W. Ryder
  • Start date
M

Michael W. Ryder

I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.
 
P

Phrogz

I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.

str.gsub /</?[^>]+>/, ''

This will only be a problem if your XML file is legal and has a CDATA
section which has a literal < character (not &lt;), like:

for ( var i=0, len=a.length; i<len; ++i )

In that case you likely want a proper XML parser (like REXML) and to
use it.

Do you really want to remove the XML, or would it suffice to just:

str.gsub! '&', '&amp;'
str.gsub! '<', '&lt;'
str.gsub! '>', '&gt;'
(and maybe even)
str.gsub! '"', '&quot;'
str.gsub! "'", '&apos;'

to make your string valid and escaped for use in an HTML context?
 
M

Michael W. Ryder

Phrogz said:
I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.

str.gsub /</?[^>]+>/, ''

This will only be a problem if your XML file is legal and has a CDATA
section which has a literal < character (not &lt;), like:

for ( var i=0, len=a.length; i<len; ++i )

In that case you likely want a proper XML parser (like REXML) and to
use it.

Do you really want to remove the XML, or would it suffice to just:

str.gsub! '&', '&amp;'
str.gsub! '<', '&lt;'
str.gsub! '>', '&gt;'
(and maybe even)
str.gsub! '"', '&quot;'
str.gsub! "'", '&apos;'

to make your string valid and escaped for use in an HTML context?

My problem is that the XML file includes
in the middle of a
couple of fields, especially in the encrypted fields. If I just strip
out the encrypted field and try to decrypt it the program crashes as the
key is invalid. I have to remove the "bad" character strings before
sending it to my decryption program. I would prefer to do this removal
before sending the file to my programs so that I don't have to deal with
these codes.
I assume that the string I am seeing is XML's way of saying CR/LF as DA
in hex is CR/LF and the output in a browser shows the field being broken
at that point. The problem is that is only the ones that I have noticed
and there may be others hiding in the data. The XML file is being
parsed for conversion to our accounts.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,853
Latest member
GeorgiaSta

Latest Threads

Top