Whitespace removal in html generated by cgi

G

Gregory Toomey

A few weeks ago a question was asked in this group about removing whitespace from html, in particular from html generated by cgi.
Here's a simple technique I developed for Linux:


1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the input verbatim to Perl. The output of the cgi is piped to delspace.pl. our whitespace munger.

#!/bin/bash
/usr/bin/perl <<'EOFPERL' | ./delspace.pl
#your cgi goes here
use strict;
$|++;
print "Content-type:text/html\n\n";
print " <h1> This is a test <h1> \n";
print " some more text\n";

EOFPERL


2. Now here's delspace.pl, the whitespace remover. It may be a little buggy, but it seems to work for my simple html.

#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;

# remove leading whitespace
s/\s+$//;

# change internal whitespace to single space
s/\s+/ /g;

# remove simple one line comments
s/<!--.*?-->//;

# another simple whitespace removal
s/> </></g;

#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
}



gtoomey
 
B

Ben Morrow

[please limit your line lengths to 72 characters]
[please make sure your blank lines are *actually* blank]

Gregory Toomey said:
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi.
Here's a simple technique I developed for Linux:

1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
input verbatim to Perl. The output of the cgi is piped to
delspace.pl. our whitespace munger.

#!/bin/bash

There is absolutely no need to use bash. If nothing better, use the
techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
thing without superflous whitespace in the first place.

2. Now here's delspace.pl, the whitespace remover. It may be a
little buggy, but it seems to work for my simple html.

#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;

# remove leading whitespace
s/\s+$//;

# change internal whitespace to single space
s/\s+/ /g;

# remove simple one line comments
s/<!--.*?-->//;

# another simple whitespace removal
s/> </></g;

You realise this changes the presentation of the HTML?
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;

Why 4?

'A little buggy'? The whole idea's fundamentally flawed: you need to
start by separating the HTTP from the HTML from the data, which means
using an HTML parsing module. For instance, what about this:

<link
rel=stylesheet
type="text/css"
href="..."/>

Or this:

Status: 302 Found
Location: ...
Content-encoding: ...
Content-type: text/html
Content-length: ...

<html>...

Or this:

<pre>
#!/usr/bin/perl

use warnings;
use strict;

print "Hello world\n";
</pre>

Ben
 
G

Gregory Toomey

It was a dark and stormy night, and Ben Morrow managed to scribble:
[please limit your line lengths to 72 characters]
[please make sure your blank lines are *actually* blank]

Gregory Toomey said:
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi.
Here's a simple technique I developed for Linux:

1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
input verbatim to Perl. The output of the cgi is piped to
delspace.pl. our whitespace munger.

#!/bin/bash

There is absolutely no need to use bash. If nothing better, use the
techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
thing without superflous whitespace in the first place.

The technique I described allows you to take an existing cgi & change 2 lines at the top & one at the bottom.
What you described will work, but its more complicated.


You realise this changes the presentation of the HTML?


Why 4?


'A little buggy'? The whole idea's fundamentally flawed: you need to
start by separating the HTTP from the HTML from the data, which means
using an HTML parsing module. For instance, what about this:

It worked with all the cgis I've created.
Its just a simple pragmatic way to solve a real world problem .


gtoomey
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi. Here's
a simple technique I developed for Linux:

What is the goal of this? Reducing the amount of data that is
transmitted to the client browser? If so, you would probably be better
off compressing the output with gzip -- all major browsers support gzip
compressed data.

[...]
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;

Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
YT5a72WavpE6GErYnLOzUQ+d
=zRRz
-----END PGP SIGNATURE-----
 
J

Jeff 'japhy' Pinyan

Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.

Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".
 
G

Gregory Toomey

It was a dark and stormy night, and Eric J. Roode managed to scribble:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



What is the goal of this? Reducing the amount of data that is
transmitted to the client browser? Yes.
If so, you would probably be better
off compressing the output with gzip -- all major browsers support gzip
compressed data.

Yes I use Apache with gzip so that's another level of compression.

People hate waiting for pages to load, especially for people on dialup.
[...]
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;

Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
YT5a72WavpE6GErYnLOzUQ+d
=zRRz
-----END PGP SIGNATURE-----
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".

Ooh, good point.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7gZY2PeouIeTNHoEQJuPwCePA4BQ8lKxNoFVeJK7PeCK7vOgaUAn1xC
xlc/HAuS24OiXl9X1RTYqVPZ
=iONd
-----END PGP SIGNATURE-----
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

People hate waiting for pages to load, especially for people on dialup.

Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :)

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP7gZyWPeouIeTNHoEQJc6QCfRsU9IVVvuPbf1LCJ65Ot7K+TVJUAnRXm
MizOFx2ThfFeAocFzgE/LLZ/
=fWE0
-----END PGP SIGNATURE-----
 
G

Gregory Toomey

It was a dark and stormy night, and Eric J. Roode managed to scribble:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :)

The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
Running the script to remove whitespace takes under 1 second for 1000 lines of HTML, and does not increase the load to any discernable extent.

The database-driven cgi I use is disk IO bound, not CPU bound.

gtoomey
 
G

Gregory Toomey

It was a dark and stormy night, and Eric J. Roode managed to scribble:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Ooh, good point.


I tried it on a dozen cgis and it worked.

To make this foolproof your need to write a HTML parser - this is left as an exercise for the reader!

gtoomey
 
C

Chris Mattern

Gregory said:
It was a dark and stormy night, and Eric J. Roode managed to scribble:




The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
Running the script to remove whitespace takes under 1 second for 1000 lines of HTML,
and does not increase the load to any discernable extent.

The database-driven cgi I use is disk IO bound, not CPU bound.
Which doesn't answer the question. The question isn't "Are you overloading the
server?", the question is "Are your users waiting longer for you to remove the
whitespace than they would wait for the whitespace to download?" Assuming there
is ten bytes of removable whitespace per line (which would be rather a lot),
then the whitespace in 1000 lines takes less than two seconds to download on
a 56K modem. It would take a small fraction of a second with broadband. It
scarcely seems worth the effort.

Chris Mattern
 
L

Louis Erickson

: It was a dark and stormy night, and Eric J. Roode managed to scribble:
:> What is the goal of this? Reducing the amount of data that is
:> transmitted to the client browser?

: Yes.

:>If so, you would probably be better
:> off compressing the output with gzip -- all major browsers support gzip
:> compressed data.

: Yes I use Apache with gzip so that's another level of compression.

If you're gzipping the output stream, then the removal of spaces isn't likely
to change your transmission size significantly, if at all. The compressor
will flatten them right out, without risking the content of the HTML.

Also note that if you have a CGI that sends back something besides HTML,
such as image or sound data, this will completely screw it up.

--
Louis Erickson - (e-mail address removed) - http://www.rdwarf.com/~wwonko/

Andrea: Unhappy the land that has no heroes.
Galileo: No, unhappy the land that needs heroes.
-- Bertolt Brecht, "Life of Galileo"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top