whitespace string only

D

David A. Black

Hi --


svg% ruby -rjj -e '/[^\s]/.dump'
Regexp /[^\s]/
0 charset_not \011-\015 (0)
1 end
svg%

D> self !~ /\S/

svg% ruby -rjj -e '/\S/.dump'
Regexp /\S/
0 charset_not \011-\012\014-\015 (0)
1 end

Ugh. So \013 (vertical tab) is defined as whitespace:

irb(main):008:0> /[\s]/.match("\013")
=> #<MatchData:0x401d75a0>

and non-whitespace:

irb(main):007:0> /\S/.match("\013")
=> #<MatchData:0x401d9c38>

Rather hard to deduce....


David
 
Y

YANAGAWA Kazuhisa

In Message-Id: <[email protected]>
Henrik Horneber said:
What's the best way to test if a string only consists of whitespaces
and newlines?

What about this?:

string !~ /\S/

where "\S" means complement of "\s". If your white spaces are not
equal to "\s", you can use an appropriate character class, say
"[^ \n]" for a character except a space and a line feed.
 
A

Ara.T.Howard

In Message-Id: <[email protected]>
Henrik Horneber said:
What's the best way to test if a string only consists of whitespaces
and newlines?

What about this?:

string !~ /\S/

where "\S" means complement of "\s". If your white spaces are not
equal to "\s", you can use an appropriate character class, say
"[^ \n]" for a character except a space and a line feed.


i use this alot:


if s.strip.empty?

# the string is whitespace only

end

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
T

ts

A> if s.strip.empty?
A> # the string is whitespace only

svg% ruby -e 'a = " \000\000"; p "OK" if a.strip.empty?'
"OK"
svg%

svg% ruby -e 'a = " \000\000 "; p "OK" if a.strip.empty?'
svg%


Guy Decoux
 
R

Robert Klemme

ts said:
A> if s.strip.empty?
A> # the string is whitespace only

svg% ruby -e 'a = " \000\000"; p "OK" if a.strip.empty?'
"OK"
svg%

svg% ruby -e 'a = " \000\000 "; p "OK" if a.strip.empty?'
svg%

Also I'd say the disadvantage of "a.strip.empty?" is that it creates a
copy of the string (=> a new instance) which is generally slower than a
simple regexp check.

Kind regards

robert
 
A

Ara.T.Howard

A> if s.strip.empty?
A> # the string is whitespace only

svg% ruby -e 'a = " \000\000"; p "OK" if a.strip.empty?'
"OK"
svg%

svg% ruby -e 'a = " \000\000 "; p "OK" if a.strip.empty?'
svg%

ahh! that's terrible - i didn't know String#strip did that! the docs say

"Returns a copy of str with leading and trailing whitespace removed."

since when is NUL whitespace!? defintely against POLS.

thanks for the pointer.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

Also I'd say the disadvantage of "a.strip.empty?" is that it creates a copy
of the string (=> a new instance) which is generally slower than a simple
regexp check.

i assumed you were correct - but this is suprising:

harp:~ > ruby b.rb
-
small string strip-empty:
elapsed : 0.0081329345703125
-
small string re:
elapsed : 0.005950927734375
-
small string re-precompiled:
elapsed : 0.00719404220581055
-
big string strip-empty:
elapsed : 0.263929843902588
-
big string re:
elapsed : 5.26733493804932
-
big string re-precompiled:
elapsed : 5.51002883911133

harp:~ > cat b.rb
$VERBOSE = nil
STDOUT.sync = true

def time label
fork do
GC.disable
puts "-\n#{ label }:\n"
a = Time::now.to_f
yield
b = Time::now.to_f
puts " elapsed : #{ b - a }"
end
Process::wait
end

s = "42"
bs = s * 8192
rep = %r/^\s*$/o


time('small string strip-empty') do
8192.times{ s.strip.empty? }
end
time('small string re') do
8192.times{ s =~ %r/^\s*$/ }
end
time('small string re-precompiled') do
8192.times{ s =~ rep }
end
time('big string strip-empty') do
8192.times{ bs.strip.empty? }
end
time('big string re') do
8192.times{ bs =~ %r/^\s*$/ }
end
time('big string re-precompiled') do
8192.times{ bs =~ rep }
end


at least it suprised me!

regards.


-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
T

ts

A> rep = %r/^\s*$/o

/o is useless and you make something too complex for the regexp engine


Guy Decoux
 
R

Robert Klemme

i assumed you were correct - but this is suprising:

I have different results:

user system total real
rx =~ s 0.031000 0.000000 0.031000 ( 0.023000)
rx =~ bs 0.016000 0.000000 0.016000 ( 0.022000)
rx !~ s 0.031000 0.000000 0.031000 ( 0.025000)
rx !~ bs 0.016000 0.000000 0.016000 ( 0.027000)
RX1 =~ s 0.031000 0.000000 0.031000 ( 0.030000)
RX1 =~ bs 0.031000 0.000000 0.031000 ( 0.030000)
RX2 !~ s 0.047000 0.000000 0.047000 ( 0.039000)
RX2 !~ bs 0.032000 0.000000 0.032000 ( 0.033000)
s =~ rx 0.031000 0.000000 0.031000 ( 0.024000)
bs =~ rx 0.015000 0.000000 0.015000 ( 0.024000)
s !~ rx 0.032000 0.000000 0.032000 ( 0.026000)
bs !~ rx 0.031000 0.000000 0.031000 ( 0.026000)
s =~ RX1 0.031000 0.000000 0.031000 ( 0.031000)
bs =~ RX1 0.031000 0.000000 0.031000 ( 0.031000)
s !~ RX2 0.032000 0.000000 0.032000 ( 0.030000)
bs !~ RX2 0.031000 0.000000 0.031000 ( 0.034000)
s.strip.empty? 0.062000 0.000000 0.062000 ( 0.054000)
bs.strip.empty? 0.047000 0.000000 0.047000 ( 0.050000)
user system total real
rx =~ s 0.032000 0.000000 0.032000 ( 0.022000)
rx =~ bs 0.015000 0.000000 0.015000 ( 0.023000)
rx !~ s 0.031000 0.000000 0.031000 ( 0.024000)
rx !~ bs 0.016000 0.000000 0.016000 ( 0.025000)
RX1 =~ s 0.031000 0.000000 0.031000 ( 0.030000)
RX1 =~ bs 0.032000 0.000000 0.032000 ( 0.031000)
RX2 !~ s 0.031000 0.000000 0.031000 ( 0.031000)
RX2 !~ bs 0.031000 0.000000 0.031000 ( 0.032000)
s =~ rx 0.031000 0.000000 0.031000 ( 0.025000)
bs =~ rx 0.032000 0.000000 0.032000 ( 0.024000)
s !~ rx 0.015000 0.000000 0.015000 ( 0.028000)
bs !~ rx 0.031000 0.000000 0.031000 ( 0.026000)
s =~ RX1 0.032000 0.000000 0.032000 ( 0.032000)
bs =~ RX1 0.015000 0.000000 0.015000 ( 0.031000)
s !~ RX2 0.032000 0.000000 0.032000 ( 0.031000)
bs !~ RX2 0.031000 0.000000 0.031000 ( 0.033000)
s.strip.empty? 0.062000 0.000000 0.062000 ( 0.051000)
bs.strip.empty? 0.047000 0.000000 0.047000 ( 0.048000)
18:05:36 [ruby]:


Regards

robert
 
A

Ara.T.Howard

A> rep = %r/^\s*$/o

/o is useless and you make something too complex for the regexp engine

why useless? you mean because there's nothing to interplate here - that's
true in this case...

what do you mean be complex? seems very simple?

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
M

Markus

since when is NUL whitespace!? defintely against POLS.

?????

Since when _isn't_ NUL whitespace? Despite the fact that it is
sometimes used as a delimiter (which is true for all the other
whitespace characters as well), it has no meaning, no glyph, does not
show up when printed--it doesn't even move the cursor/printhead. How
much more "whitespace" can you get?

-- MarkusQ
 
T

ts

A> what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :)

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
0 begline
1 on_failure_jump ==> 4
2 charset \011-\012\014-\015 (0)
3 maybe_finalize_jump ==> 1
4 endline
5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

0 begline | |
1 on_failure_jump | | >4[0]
2 charset | |
3 maybe_finalize_jump | |
1 on_failure_jump | | >4[1]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[2]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[3]
2 charset | | F4[3]
4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'
Regexp /[^\S]/
0 charset_not \000-\010\016-\037!-\377 (0)
1 end
Fastmap supplied : \011-\015

String << >> pos=0

0 charset_not | | SUCCESS
svg%



Guy Decoux
 
A

Ara.T.Howard

?????

Since when _isn't_ NUL whitespace? Despite the fact that it is sometimes
used as a delimiter (which is true for all the other whitespace characters
as well), it has no meaning, no glyph, does not show up when printed--it
doesn't even move the cursor/printhead. How much more "whitespace" can you
get?

from man

man isspace
...
isspace()
checks for white-space characters. In the "C" and "POSIX"
locales, these are: space, form-feed ('\f'), newline ('\n'),
carriage return ('\r'), horizontal tab ('\t'), and vertical tab
('\v').
...

from wikipedia

In computer science, a whitespace (or a whitespace character) is any
character which does not display itself but does take up space. For example,
the character symbol " ", which is a blank space. Whitespaces are generated by
the space bar or the Tab key; depending on context, a line-break generated by
the Return key (Enter key) may be considered whitespace as well.

Whitespace can also refer to a series of whitespace characters. Within source
code, the size of whitespace is generally ignored by free-form languages. In
the Python programming language whitespace and indentation are used for
syntactical purposes.

In many programming languages abundant use of whitespace, especially trailing
whitespace at the end of lines, is considered a nuisance.

[ \t]+ is a regular expression that matches whitespace.

The term whitespace is based on the assumption that the background color used
for text is white, and is thus confusing if it is not.


there is a long standing precendent for the meaning of whitespace. it does not
include non-printables since that do not take up any __space__. for that there
is isgraph(3)

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

A> what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :)

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
0 begline
1 on_failure_jump ==> 4
2 charset \011-\012\014-\015 (0)
3 maybe_finalize_jump ==> 1
4 endline
5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

0 begline | |
1 on_failure_jump | | >4[0]
2 charset | |
3 maybe_finalize_jump | |
1 on_failure_jump | | >4[1]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[2]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[3]
2 charset | | F4[3]
4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'
Regexp /[^\S]/
0 charset_not \000-\010\016-\037!-\377 (0)
1 end
Fastmap supplied : \011-\015

String << >> pos=0

0 charset_not | | SUCCESS
svg%


fascinating!

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
M

Markus

[abundant online rebuttal snipped]

Gosh. So I guess your answer to my question is "since the late
1970's or so", which, coincidentally, is most likely the last time I
checked. Back in the RS-232 paper tape and teletype days (ASCII)
whitespace was the same as non-printing, e.g. everything <= 040 and
sometimes 0177, while the EBCDIC definition was a little murkier.

*laugh* I wonder if anything else has changed in the last thirty
years?

Thanks,

-- Markus
 
H

Henrik Horneber

... ask one simple question and you end up being involved in regexp
blackmagic and a computer history discussion :)

What is this 'jj' file that is being used to analyze these regexps?

I wasn't able to find it and it looks very interesting. Any hints?

regards,

Henrik
 
A

Ara.T.Howard

[abundant online rebuttal snipped]

Gosh. So I guess your answer to my question is "since the late
1970's or so", which, coincidentally, is most likely the last time I
checked. Back in the RS-232 paper tape and teletype days (ASCII)
whitespace was the same as non-printing, e.g. everything <= 040 and
sometimes 0177, while the EBCDIC definition was a little murkier.

*laugh* I wonder if anything else has changed in the last thirty
years?

lol.

perhaps i was a bit to aggressive in that reply - you sarcasm gave me a good
laugh at myself. no harm intended.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
R

Robert Klemme

ts said:
A> what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :)

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
0 begline
1 on_failure_jump ==> 4
2 charset \011-\012\014-\015 (0)
3 maybe_finalize_jump ==> 1
4 endline
5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

0 begline | |
1 on_failure_jump | | >4[0]
2 charset | |
3 maybe_finalize_jump | |
1 on_failure_jump | | >4[1]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[2]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[3]
2 charset | | F4[3]
4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'

Isn't this the same as /\s/? I mean /[^\S]/ means not not whitespace,
doesn't it?
Regexp /[^\S]/
0 charset_not \000-\010\016-\037!-\377 (0)
1 end
Fastmap supplied : \011-\015

String << >> pos=0

0 charset_not | | SUCCESS
svg%

Err... Those two regexps you present do not yield an equivalent result.
Or did I miss something? Or did you mean to use /\S/ for the second one?

Regards

robert
 
T

ts

R> Err... Those two regexps you present do not yield an equivalent result.
R> Or did I miss something? Or did you mean to use /\S/ for the second one?

In this case the result is not important : this is just to show that
internally it can make something completely different.


Guy Decoux
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,414
Latest member
GayleWedel

Latest Threads

Top