removing Whitespace using regexp

Arun Kumar · May 6, 2009

Hi,
Previously I posted a topic on how to strip all html tags and getting
the remaining text using regexp. Luckily I got one. This is the regexp:

/([^>]*)(?=<[^>]*?>)/im

In this case I'm able to get all the data between the html tags. But one
small problem. I'm getting output like this :

Example Web Page

You have reached this web page by typing "example.com",
"example.net",
or "example.org" into your web browser.
These domain names are reserved for use in documentation and are not
available
for registration. See RFC
2606, Section 3.

This is the output which I get when I parse the html content of
example.com using the above regexp. Here you can see some white space
between the data(ie. between 'Example web page' and 'You have
reached...'. These whitespaces are generated in place of the html tags
which I avoided using the above regexp. I want to remove those
whitespaces. I think that modifying the above regexp will give me the
right output without white spaces. Can somebody please help me.

Thanks
Arun

Simon Krahnke · May 6, 2009

* Arun Kumar said:
Hi,
Previously I posted a topic on how to strip all html tags and getting
the remaining text using regexp. Luckily I got one. This is the regexp:

/([^>]*)(?=<[^>]*?>)/im

And what do you do with this regexp?

In this case I'm able to get all the data between the html tags. But one
small problem.

Hasn't everybody told you, there are problems with parsing HTML with regexps?

This is the output which I get when I parse the html content of
example.com using the above regexp. Here you can see some white space
between the data(ie. between 'Example web page' and 'You have
reached...'. These whitespaces are generated in place of the html tags
which I avoided using the above regexp.

Really? Aren't they just from all the meaningless whitespace that's in
a typical HTML document?

I want to remove those
whitespaces. I think that modifying the above regexp will give me the
right output without white spaces. Can somebody please help me.

There are easy ways to strip all the whitespace, which is certainly not
what you want, and there is a simple way to reduce all runs of whitespace
by just one space (gsub(/\s+/, ' '), which probably also not what you
want.

Selectively removing some of the whitespace isn't easy at all, but it is
probably a lot easier with a real HTML parser.

mfg, simon .... l

Sriram Varahan · May 6, 2009

Hey Arun,

How about doing a gsub on the output to remove white spaces.

For example:

"Example Web Page".gsub(" ","")

This would remove the white spaces.

Hope this helps.

Regards
Sriram.

Srijayanth Sridhar · May 6, 2009

[Note: parts of this message were removed to make it a legal post.]

I know your boss and whoever it is who is dangling your carrots won't let
you use Hpricot, but tell him you will use Hpricot to get properly formatted
html and then write a parser to parse the properly formatted html. Even he
can't be opposed to that(seeing as how he wants you to reinvent wheels).
That way you can get rid of your whitespace problem and deal with the cosmos
at large.

Jayanth

* Arun Kumar said:
* Arun Kumar said:

Hi,
Previously I posted a topic on how to strip all html tags and getting
the remaining text using regexp. Luckily I got one. This is the regexp:

/([^>]*)(?=<[^>]*?>)/im

Click to expand...

And what do you do with this regexp?

In this case I'm able to get all the data between the html tags. But one
small problem.

Click to expand...

Hasn't everybody told you, there are problems with parsing HTML with
regexps?

This is the output which I get when I parse the html content of
example.com using the above regexp. Here you can see some white space
between the data(ie. between 'Example web page' and 'You have
reached...'. These whitespaces are generated in place of the html tags
which I avoided using the above regexp.

Click to expand...

Really? Aren't they just from all the meaningless whitespace that's in
a typical HTML document?

I want to remove those
whitespaces. I think that modifying the above regexp will give me the
right output without white spaces. Can somebody please help me.

Click to expand...

There are easy ways to strip all the whitespace, which is certainly not
what you want, and there is a simple way to reduce all runs of whitespace
by just one space (gsub(/\s+/, ' '), which probably also not what you
want.

Selectively removing some of the whitespace isn't easy at all, but it is
probably a lot easier with a real HTML parser.

mfg, simon .... l

Robert Klemme · May 6, 2009

2009/5/6 Sriram Varahan said:
How about doing a gsub on the output to remove white spaces.

For example:

"Example Web Page".gsub(" ","")

This would remove the white spaces.

I would rather do

s.gsub /\s+/, ' '

Because your statement removes *all* whitespace:

irb(main):002:0> "Example Web Page".gsub(" ","")
=> "ExampleWebPage"

This is usually not what you want.

Hope this helps.

Dito.

Cheers

robert

Mark Thomas · May 7, 2009

I know your boss and whoever it is who is dangling your carrots won't let
you use Hpricot, but tell him you will use Hpricot to get properly formatted
html and then write a parser to parse the properly formatted html. Even he
can't be opposed to that(seeing as how he wants you to reinvent wheels).
That way you can get rid of your whitespace problem and deal with the cosmos
at large.

Here's what we know about Arun from his previous posts...
* he is a "trainee" doing assignments.
* he is learning ruby
* "nobody else" around him knows ruby
* his "boss"/teacher is giving him specific assignments that seem to
be purely academic exercises, because the constraints (e.g. don't use
gsub, parse "example.com") would otherwise be completely ridiculous.
* He doesn't have the authority to re-scope the assignment or offer
alternate solutions.

I believe he is asking us to do his homework.

Srijayanth Sridhar · May 7, 2009

[Note: parts of this message were removed to make it a legal post.]

How many places do you know that have extensive Ruby training programs that
expect you to write HTML parsers armed with nothing but regular expressions?

I live in Bangalore, and I don't know one. Either he truly has a sadistic
boss, or his truth is stranger than fiction. I don't doubt that it is
homework of some sort.

Jayanth

SVG not full width and space	0	Sep 15, 2023
Strinpping html using regexp	3	May 5, 2009
String extraction using RegExp	2	Jun 9, 2008
Ruby Regexp implementation?	2	Mar 12, 2010
REGEXP HELP	6	Aug 21, 2008
New bie Question: how to remove space between a sentence and ul	0	Apr 4, 2011
regexp to pars a HTML tag value	5	Dec 15, 2009
OOo and regexp	0	Dec 3, 2006

removing Whitespace using regexp

Arun Kumar

Simon Krahnke

Sriram Varahan

Srijayanth Sridhar

Robert Klemme

Mark Thomas

Srijayanth Sridhar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads