Question about scoping

B

Bob Dubery

Hi all,

Some extracts from a program that I recently worked on follow.

The program is a "listener" that waits for data coming in on a
specific port. The incoming messages will contain 2 XML records, each
of which must be validated.

This program was exhibiting the characteristics of a memory leak. When
started up it would consume a certain amount of memory, but over time
the memory in use would grow and grow.

Here's the code I found (not the whole program) then I'll show my
changes and ask the actual question...

<code snippets start>
#! /usr/bin/perl

use strict;
use IO::Socket;
use XML::DOM;
use English;
use DBI;
use lib "/usr/local/PostOffice/progs/modules";
use QueuesDatabase;

.......
.......
my $line; # will contain all the input received
my $data; # will contain the current input

# get the input from the client
my $bytesRead = sysread($new_sock, $data, 2048);

# the data received can be more than 2048 characters, so we need to
keep reading if we havn't received the
# end of transmission character
while ($bytesRead)
{
$line = $line . $data;

# if it is the end of transmission, set the variable to 0 to exit
the while loop
if ($line =~ /$/)
{
$bytesRead = 0;
}
# otherwise keep reading from the socket
else
{
$bytesRead = sysread($new_sock, $data, 2048);
}
}

# make sure all the conditions are met
validateFile($line);

......
......


sub validateFile
{
# get the value passed to this subroutine
my $dataString = shift;

# remove the newline character and all other formatting characters
chomp $dataString;
$dataString =~ s/\t//g;
$dataString =~ s/\n//g;
$dataString =~ s/\r//g;
$dataString =~ s/\f//g;

# check for and remove the starting charater for the data transfer
if ($dataString =~ /^/) # character (0x02)
{
$dataString =~ s/^//; # character (0x02)

# check for and remove the ending charater for the data transfer
if ($dataString =~ /$/) # character (0x03)
{
$dataString =~ s/$//; # character (0x03)

# make sure the very first part of the file is a valid xml
header with utf-8 encoding included
unless ($dataString =~ /^<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>/)
{
$errMsg = "Invalid xml header for the routing header.";
}
else
{
# now check for the second xml header for the body file
unless ($dataString =~ /<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>/)
{
$errMsg = "Invalid xml header for the xml body file.";
}
else
{
# separate the xml header and xml body file into 2 different
variables
$dataString =~ s/^(<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>.+)(<\?xml version=\"1\.0\" encoding=\"UTF-8\"\>.+)$/
$1$2/;
my $headerFile = $1;
my $bodyFile = $2;

# parse the xml headerFile to make sure that it is a well-
formed xml file
eval {new XML::DOM::parser->parse($headerFile)}; # create
a new parser object and load the input file

# parser will have died with error message in $EVAL_ERROR if
XML is not well-formed
if($EVAL_ERROR)
{
$errMsg = "XML header file is not well-formed and can't be
processed.";
}
else
{
# parse the xml bodyFile to make sure that it is a well-
formed xml file
eval {new XML::DOM::parser->parse($bodyFile)}; # create
a new parser object and load the input file

# parser will have died with error message in $EVAL_ERROR
if XML is not well-formed
if($EVAL_ERROR)
{
$errMsg = "XML body file is not well-formed and can't be
processed.";
}
}
}
}
}
else
{
$errMsg = "No closing flag found for the data.";
}
}
else
{
$errMsg = "No opening flag found for the data.";
}
}

<code snippets end>

So the program uses "strict" and 'my' is used in all the subroutines.

Now I made the following change....

<modified snippet starts>
{
# separate the xml header and xml body file into 2 different
variables
$dataString =~ s/^(<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>.+)(<\?xml version=\"1\.0\" encoding=\"UTF-8\"\?>.+)$/
$1$2/;
my $headerFile = $1;
my $bodyFile = $2;

# parse the xml headerFile to make sure that it is a well-
formed xml file
my $headerXML = eval {new XML::DOM::parser-
parse($headerFile)}; # create a new parser object and load the
input file

# parser will have died with error message in $EVAL_ERROR if
XML is not well-formed
if($EVAL_ERROR)
{
$errMsg = "XML header file is not well-formed and can't be
processed.";
}
else
{
# parse the xml bodyFile to make sure that it is a well-
formed xml file
my $bodyXML = eval {new XML::DOM::parser-
parse($bodyFile)}; # create a new parser object and load the input
file

# parser will have died with error message in $EVAL_ERROR
if XML is not well-formed
if($EVAL_ERROR)
{
$errMsg = "XML body file is not well-formed and can't be
processed.";
}
else
{
$bodyXML->dispose;
}
$headerXML->dispose;
}
}
<modified snippet ends>

So what I did was to open the XML::DOM::parser objects to variable
names and then call the dispose method once the XML has been parsed to
see if it's well formed. Result, memory usage - except for the brief
period when a document is being parsed - is more or less constant.

So my question is about the scoping in the original code. Within a
subroutine there is a line like
eval {new XML::DOM::parser->parse($headerFile)};

How is that parser object scoped? I imagine that the author of the
code expected the object to dissappear out of memory once the
subroutine was entered.

So why the constant growth in memory usage? I can think of only two
possibilities....

1) eval {new XML::DOM::parser->parse($headerFile)}; results in
somethong that is globally scoped

2) XML::DOM::parser creates a whole lot of other objects/variables in
memory that persist even when the actual XML::DOM::parser object
passes out of scope.

Or is there another reason?

Thanks

Bob
 
J

John W. Krahn

Bob said:
sub validateFile
{
# get the value passed to this subroutine
my $dataString = shift;

# remove the newline character and all other formatting characters
chomp $dataString;
$dataString =~ s/\t//g;
$dataString =~ s/\n//g;
$dataString =~ s/\r//g;
$dataString =~ s/\f//g;

OMG! Replace the previous five lines with:

$dataString =~ tr/\t\n\r\f//d;

# check for and remove the starting charater for the data transfer
if ($dataString =~ /^/) # character (0x02)
{
$dataString =~ s/^//; # character (0x02)

# check for and remove the ending charater for the data transfer
if ($dataString =~ /$/) # character (0x03)
{
$dataString =~ s/$//; # character (0x03)
You don't need to run the same regular expression twice, nor do you really
need the comment:

# check for and remove the starting charater for the data transfer
if ($dataString =~ s/^\x02//)
{
# check for and remove the ending charater for the data transfer
if ($dataString =~ s/\x03$//)
{



John
 
B

Bob Dubery

John's reply was informative, but it did not address my immediate
concern.

However, I think I may have moved on to a state where I remain
confused, but confused at a higher level :)

Considering this....
eval {new XML::DOM::parser->parse($headerFile)};

Is it the case that because neither the parser object nor the results
of the eval are assigned to any kind of variable that either or both
are regarded as barewords?
 
B

Ben Morrow

Quoth "Bob Dubery said:
John's reply was informative, but it did not address my immediate
concern.

However, I think I may have moved on to a state where I remain
confused, but confused at a higher level :)

Considering this....
eval {new XML::DOM::parser->parse($headerFile)};

I'm amazed this syntax works at all. It appears to be parsed as

eval { (new XML::DOM::parser)->parse($headerFile) };

or

eval { XML::DOM::parser->new->parse($headerFile) };

which would be the recommended way to write it. The indirect object
notation ('new Foo' as opposed to 'Foo->new') is considered to have been
a bad idea: it confuses the reader, and under the wrong circumstances it
can confuse Perl.
Is it the case that because neither the parser object nor the results
of the eval are assigned to any kind of variable that either or both
are regarded as barewords?

Err... no. What do you think a 'bareword' is? It's actually an
expression like XML::DOM::parser or STDOUT that Perl parses as a string
even though it isn't quoted: hence 'bare'.

Since you haven't posted a complete program, I can't tell what you are
trying to achieve; but I would guess that you don't want to throw away
the result of the parse. So you want something like

my $doc = eval {
XML::DOM::parser->new->parse($headerFile)
}
or die "parse of '$headerFile' failed: $@";

(Of course, if you are just going to die on error you don't need the
eval: just let XML::DOM::parser die for you. I am assuming you want some
more sophisticated error handling here.)

Ben
 
B

Bob Dubery

I'm amazed this syntax works at all. It appears to be parsed as

eval { (new XML::DOM::parser)->parse($headerFile) };

or

eval { XML::DOM::parser->new->parse($headerFile) };

which would be the recommended way to write it. The indirect object
notation ('new Foo' as opposed to 'Foo->new') is considered to have been
a bad idea: it confuses the reader, and under the wrong circumstances it
can confuse Perl.

Well thanks for that. I will try making that change to this program.
Err... no. What do you think a 'bareword' is? It's actually an
expression like XML::DOM::parser or STDOUT that Perl parses as a string
even though it isn't quoted: hence 'bare'.

Since you haven't posted a complete program, I can't tell what you are
trying to achieve; but I would guess that you don't want to throw away
the result of the parse.

This is a program written by somebody else. It's a "listener". It is
started up and "listens" for messages coming in on a certain port. The
incoming messages each contain two XML documents - one is a routing
header and the second is the actual payload.

I initially thought it had a memory leak as it kept on using more and
more memory until eventually it either fell over or the server just
ran out of resources.

I started trying to figure out why this was happening, and the only
thing that seemed potentially troublesome was the use of the XML
parser.

In fact the results of the parse were NOT being used. The parsing was
only done to check that both XML documents were well-formed. If either
was not well-formed then an error message had to be entered into a
log, and a signal sent back out on the socket indicating a "nack".

My thinking was that by not explicitly assigning the parser objects to
a variable, thus making it impossible to close them, some resources
used by instances of the parser were not being released.

Since I changed the statement that you're commenting on to something
like
my $bodyXML = eval {new XML::DOM::parser- >parse($bodyFile)};

and going on to perform a $bodyXML->dispose()

the listener has run more reliably and memory usage is reduced. It is
still creeping up with time, but much more slowly.

So I'm going to try the alternate notation that you offer and see if
that brings a further improvement.

Thanks

Bob
 
B

Bob Dubery

Well thanks for that. I will try making that change to this program.
Since I changed the statement that you're commenting on to something
like
my $bodyXML = eval {newXML::DOM::parser- >parse($bodyFile)};

and going on to perform a $bodyXML->dispose()

the listener has run more reliably and memory usage is reduced. It is
still creeping up with time, but much more slowly.

So I'm going to try the alternate notation that you offer and see if
that brings a further improvement.

And it DID improve things... but the memory consumed by the program
was still creeping up.

In the end I tried XML::parser::Expat instead. This means some changes
to the code as XML::parser::Expat objects cannot be reused and may
only perform ONE parse. But the leak is fixed.

Prior to this I had tried using XML::parser instead of XML::DOM, but
the problem did not go away.

Is there a memory leak in XML::parser?


The code for the validation routine now looks like this....


sub validateFile
{
# get the value passed to this subroutine
my $dataString = shift;

# eliminate tabs, line feeds etc...
$dataString =~ tr/\t\n\r\f//d;
# eliminate padding between tags...
$dataString =~ s/>\s*</></g;

# create parser objects
# these are expat parsers - cannot be reused
my $hParser = XML::parser::Expat->new(ProtocolEncoding => 'UTF-8');
my $bParser = XML::parser::Expat->new(ProtocolEncoding => 'UTF-8');

# check for and remove the starting charater for the data transfer
if ($dataString =~ /^/) # character (0x02)
{
$dataString =~ s/^//; # character (0x02)

# check for and remove the ending charater for the data transfer
if ($dataString =~ /$/) # character (0x03)
{
$dataString =~ s/$//; # character (0x03)

# make sure the very first part of the file is a valid xml header
with utf-8 encoding included
unless ($dataString =~ /^<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>/)
{
$errMsg = "Invalid xml header for the routing header.";
}
else
{
# now check for the second xml header for the body file
unless ($dataString =~ /<\?xml version=\"1\.0\" encoding=
\"UTF-8\"\?>/)
{
$errMsg = "Invalid xml header for the xml body file.";
}
else
{
# separate the xml header and xml body file into 2 different
variables
$dataString =~ s/^(<\?xml version=\"1\.0\" encoding=\"UTF-8\"\?>.
+)(<\?xml version=\"1\.0\" encoding=\"UTF-8\"\?>.+)$/$1$2/;
my $headerFile = $1;
my $bodyFile = $2;

# parse the xml headerFile to make sure that it is a well-formed
xml file
my $headerOK = eval {$hParser->parse($headerFile)};

# parser will have died and eval failed if XML is not well-formed
unless($headerOK)
{
$errMsg = "XML header file is not well-formed and can't be
processed.";
}
else
{
# parse the xml bodyFile to make sure that it is a well-formed
xml file
my $bodyOK = eval {$bParser->parse($bodyFile)};

# parser will have died with error message in $EVAL_ERROR if XML
is not well-formed
unless($bodyOK)
{
$errMsg = "XML body file is not well-formed and can't be
processed.";
}
}
}
}
}
else
{
$errMsg = "No closing flag found for the data.";
}
}
else
{
$errMsg = "No opening flag found for the data.";
}
$hParser->release;
$bParser->release;
}

Now there is one remaining oddity... even though I have called the
release methods on the parsers, the memory consumed by the parsers
only gets released when the subroutine is executed again and the
statements

my $hParser = XML::parser::Expat->new(ProtocolEncoding => 'UTF-8');
my $bParser = XML::parser::Expat->new(ProtocolEncoding => 'UTF-8');

are executed.

However, this means that the memory usage will not creep. It goes up
when a large incoming message is received, and may stay up for some
time, but the next incoming message will effectively release the
memory resources by instantiating new Expat parsers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,815
Latest member
treekmostly22

Latest Threads

Top