Text Parsing Help

J

Jester Mania

Greetings,

I am new to Ruby and programming and am trying to parse a text file, but
encountered some difficulties.

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\\\n", "\t")

but it doesn't seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

Any help is greatly appreciated!
 
J

Jeremy Bopp

Greetings,

I am new to Ruby and programming and am trying to parse a text file, but
encountered some difficulties.

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\\\n", "\t")

but it doesn't seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

In Ruby, the literal "\n" is a string consisting of only a newline
character. If you want the string to literally be backslash n (\n),
then you would use "\\n". The backslash is a special character within
string literals, so if you want it to appear literally in your string,
you have to escape it with another backslash. Your example in the gsub
call above is actually creating a search string of backslash backslash n
(\\n) because you have 4 backslashes preceding the n, but that text does
not appear in your input.

-Jeremy
 
P

Peter Vandenabeele

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString =C2=A0 [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. =C2=A0The issue here is that \n happens to be a
newline character. =C2=A0I tried searching the forums and tried the follo= wing
code:

lineItem =3D line.gsub("\\\\n", "\t")

This may be useful:

$ irbed as newline
=3D> "car\\nplane\\ntrain \\n boat"TAB>
=3D> "car\tplane\ttrain \t boat"
Peter
 
J

Jester Mania

Thanks for the help! I have a question though regarding Peter's reply:
newline


Currently, my code is:

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
end

How would I use the ' ' with the line variable?
 
J

Jesús Gabriel y Galán

Thanks for the help! =A0I have a question though regarding Peter's reply:
ted as
newline


Currently, my code is:

IO.readlines("input.txt").each do |line|
=A0lineItem =3D line.gsub(/\\n/, "\t")
end

How would I use the ' ' with the line variable?

You don't need it, because what you read from the file are already the
character '\' and the character 'n'. Peter needed it because he was
typing Ruby string literals.

Jesus.
 
J

Jester Mania

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.
 
P

Peter Vandenabeele

Yes, but I tried the code and it is still not working. =C2=A0I used a put= s
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
=C2=A0lineItem =3D line.gsub(/\\n/, "\t")
=C2=A0puts lineItem.split("\t")
end

However, the results were that the output still had \n text.

I hope my example below can explain what happens

$ ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

I used this input.txt file for testing

<start of file>
car\nplane\ntrain \n boat

second line, first token \n second token
<end of file>

irb(main):013:0> IO.readlines("input.txt").each do |line|
irb(main):014:1* lineItem =3D line.gsub(/\\n/, "\t")
irb(main):015:1> puts lineItem.split("\t").inspect
irb(main):016:1> end
["car", "plane", "train ", " boat\n"] # the first line is parsed
and split correctly into this array
["\n"] # the second line only has a newline
["second line, first token ", " second token\n"] # correct too
=3D> ["car\\nplane\\ntrain \\n boat\n", "\n", "second line, first token
\\n second token\n"]

# this last line is the result IO.readlines("input.txt") because the
"each" method
eventually returns self after having iterated over all entities

irb(main):017:0> IO.readlines("input.txt").each do |line|
irb(main):018:1* lineItem =3D line.gsub(/\\n/, "\t")
irb(main):019:1> puts lineItem.split("\t")
irb(main):020:1> end
car
plane
train
boat

second line, first token
second token
=3D> ["car\\nplane\\ntrain \\n boat\n", "\n", "second line, first token
\\n second token\n"]


So, one trick is to use .inspect and .class in many cases to better
understand what is
the object you are looking at and what the content really is.

Also, you could use chomp to get rid of the newline at the end of the
last entry in your array of tokens.
So, a shorter piece of code that may be useful is:

irb(main):025:0> IO.readlines("input.txt").map do |line|
irb(main):026:1* line.chomp.gsub(/\\n/, "\t")
irb(main):027:1> end
=3D> ["car\tplane\ttrain \t boat", "", "second line, first token \t second =
token"]

Now there are the <TAB> delimiters that you wanted between the tokens
in the resulting output.

HTH,

Peter
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.
"\n" is a newline
"\\n" is a backslash, letter n
'\n' is the same as "\\n" but you can ignore that if it is confusing,
because it only counts when you enter it as a literal.

You say you want to see whether "\n" is being replaced by a tab, but you are
replacing /\\n/ (btw, you could use a string here). You say the output has
\n in the text. By that, I assume you mean it has a newline, but are
misinterpreting it as "\\n" which you replaced. If this is accurate, you
should decide whether you wish to replace "\n" or "\\n". As peter said,
using inspect (ie: puts line.inspect) is a good way to see your String data.

Also, if you don't already have tabs that you also wish to split on, then
you don't need the gsub step, you can just split on the "\\n". Here are a
couple of examples to hopefully make it a little easier to see.
"a\nb\\nc".split("\\n") # => ["a\nb", "c"]
"a\nb\\nc".split("\n") # => ["a", "b\\nc"]
"a\nb\\nc\td".gsub("\\n","\t").split("\t") # => ["a\nb", "c", "d"]
"a\nb\\nc\td".gsub("\n","\t").split("\t") # => ["a", "b\\nc", "d"]
 
J

Jester Mania

Peter/Josh,

Thanks once again for the helpful posts. I am learning quite a bit
which is good. However, I just tried to replicate Peter's example and
when I attempted to use the .inspect method, the output was not what I
expected:

INPUT FILE <input.txt>
--------------------------
car\nplane\ntrain \n boat

second line, first token \n second token
--------------------------

OUTPUT <windows cmd console>
--------------------------
["\377\376c\000a\000r\000\\\000n\000p\000l\000a\000n\000e\000\\\000n\000t\000r\0
00a\000i\000n\000 \000\\\000n\000 \000b\000o\000a\000t\000\r\000\n"]
["\000\r\000\n"]
["\000s\000e\000c\000o\000n\000d\000 \000l\000i\000n\000e\000,\000
\000f\000i\00
0r\000s\000t\000 \000t\000o\000k\000e\000n\000 \000\\\000n\000
\000s\000e\000c\0
00o\000n\000d\000 \000t\000o\000k\000e\000n\000"]
 
J

Jester Mania

Ah hah! I figured it out, the txt file had the wrong encoding. I
encoded it with UTF-8 in Notepad++ and everything works as expected. I
thank everyone for writing these meaningful replies.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,817
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top