U
Une bévue
i've a regexp :
SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
Regexp::EXTENDED, 'N')
which is supposed to strip out everything being inside :
<script ...>(part suppressed...)</script>
it works well for some html file but crash over other with the following
error message :
RegexpError: Stack overflow in regexp matcher:
/<script[^>]*>((.|\n)(?!\/script))*<\/script>/xn
method gsub
in check_files.rb at line 38
method stripHTML
in check_files.rb at line 38
[...]
ligne 38 being :
self.gsub(SCRIPT_RE, '').gsub(TAGS_RE, '').gsub(/\s+/, '
').gsub(NBSP_RE, '')
with :
SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
Regexp::EXTENDED, 'N')
what i want to do :
strip out all the contents of scripts, all the html tags with their
attributes, and also i have to add striping out any css declaration (not
done yet).
the prog failes for a file having the following parts for script :
<script type="text/javascript"
src="Mac-roman-utf-8_fichiers/wikibits.js"><!-- wikibits js --></script>
<script type="text/javascript"
src="Mac-roman-utf-8_fichiers/index.php"><!-- site js --></script>
<style type="text/css">/*<![CDATA[*/
@import
"/w/index.php?title=MediaWiki:Common.css&action=raw&ctype=text/css&smaxa
ge=2678400";
@import
"/w/index.php?title=MediaWiki:Monobook.css&action=raw&ctype=text/css&sma
xage=2678400";
@import "/w/index.php?title=-&action=raw&gen=css&maxage=2678400";
/*]]>*/</style></head><body class="ns-0 ltr">
and also having some script inside divs of body :
<script type="text/javascript"> if (window.isMSIE55) fixalpha();
</script>
[...]
<script type="text/javascript"> if (window.runOnloadHook)
runOnloadHook();</script>
i can't make use of tidy for that purpose, because the reason to strip
out any kind of html, to keep the text only, is to help some prog
finding out the encoding of the file.
SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
Regexp::EXTENDED, 'N')
which is supposed to strip out everything being inside :
<script ...>(part suppressed...)</script>
it works well for some html file but crash over other with the following
error message :
RegexpError: Stack overflow in regexp matcher:
/<script[^>]*>((.|\n)(?!\/script))*<\/script>/xn
method gsub
in check_files.rb at line 38
method stripHTML
in check_files.rb at line 38
[...]
ligne 38 being :
self.gsub(SCRIPT_RE, '').gsub(TAGS_RE, '').gsub(/\s+/, '
').gsub(NBSP_RE, '')
with :
SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
Regexp::EXTENDED, 'N')
what i want to do :
strip out all the contents of scripts, all the html tags with their
attributes, and also i have to add striping out any css declaration (not
done yet).
the prog failes for a file having the following parts for script :
<script type="text/javascript"
src="Mac-roman-utf-8_fichiers/wikibits.js"><!-- wikibits js --></script>
<script type="text/javascript"
src="Mac-roman-utf-8_fichiers/index.php"><!-- site js --></script>
<style type="text/css">/*<![CDATA[*/
@import
"/w/index.php?title=MediaWiki:Common.css&action=raw&ctype=text/css&smaxa
ge=2678400";
@import
"/w/index.php?title=MediaWiki:Monobook.css&action=raw&ctype=text/css&sma
xage=2678400";
@import "/w/index.php?title=-&action=raw&gen=css&maxage=2678400";
/*]]>*/</style></head><body class="ns-0 ltr">
and also having some script inside divs of body :
<script type="text/javascript"> if (window.isMSIE55) fixalpha();
</script>
[...]
<script type="text/javascript"> if (window.runOnloadHook)
runOnloadHook();</script>
i can't make use of tidy for that purpose, because the reason to strip
out any kind of html, to keep the text only, is to help some prog
finding out the encoding of the file.