Reading huge text files one line at a time....

B

Brock Heinz

Hello All,

I've done quite a bit of research on this one and I'm still stumped.
I have an application that reads a text file (up to 100MB in size) one
line at at time, converts the line to XML using Castor (each line is a
specific record) and then sends a JMS message for that line. After
validating the file one line at a time (never reading the entire
contents into memory), I am then confident I can perform the Castor
transformation / send operation. I'm doing something like the
following:

BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null;) {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}
}
}
reader.close();


Does anyone see any problems with reading the files one line at a time
in this manner (using the readLine() method)? I seem to hit an
OutofMemoryException right around line 315,000. Is the readLine()
method interally not efficient to use?

In the archives I've seen the approach of reading chunks of the file
with a buffer, and then determining each line by seaching for carriage
returns or line breaks. Anyone have any thoughts on this?

Any help would be greatly appreciated.

Thanks,
Brock
 
T

thirdrock

Brock said:
Hello All,
BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null;) {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);

What object type is transformer?
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();
What is this?
Where is 'message' garbage collected?
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}
}
}
reader.close();


Does anyone see any problems with reading the files one line at a time
in this manner (using the readLine() method)? I seem to hit an
OutofMemoryException right around line 315,000.

That would tend to indicate that you are running out of memory.
Is the readLine()
method interally not efficient to use?
What makes you think it is the readline() method that is sucking up all
of the memory?
In the archives I've seen the approach of reading chunks of the file
with a buffer, and then determining each line by seaching for carriage
returns or line breaks.

That will only help once you have determined that readline() is the
cause of the problem.

Ian
 
E

EricF

Hello All,

I've done quite a bit of research on this one and I'm still stumped.
I have an application that reads a text file (up to 100MB in size) one
line at at time, converts the line to XML using Castor (each line is a
specific record) and then sends a JMS message for that line. After
validating the file one line at a time (never reading the entire
contents into memory), I am then confident I can perform the Castor
transformation / send operation. I'm doing something like the
following:

BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null;) {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}
}
}
reader.close();


Does anyone see any problems with reading the files one line at a time
in this manner (using the readLine() method)? I seem to hit an
OutofMemoryException right around line 315,000. Is the readLine()
method interally not efficient to use?

In the archives I've seen the approach of reading chunks of the file
with a buffer, and then determining each line by seaching for carriage
returns or line breaks. Anyone have any thoughts on this?

Any help would be greatly appreciated.

Thanks,
Brock

I don't think the problem is with readline. You have a memory leak.

Is the finalize call really doing anything?

Try setting any variables to null when you are thru with them at the end of
the for loop. Particulalry message.

Eric
 
B

Boudewijn Dijkstra

Brock Heinz said:
Hello All,

I've done quite a bit of research on this one and I'm still stumped.
I have an application that reads a text file (up to 100MB in size) one
line at at time, converts the line to XML using Castor (each line is a
specific record) and then sends a JMS message for that line. After
validating the file one line at a time (never reading the entire
contents into memory), I am then confident I can perform the Castor
transformation / send operation. I'm doing something like the
following:

BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null;) {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}
}
}
reader.close();

What happens with the IMessage object after it is sent?
 
J

John C. Bollinger

Brock said:
I've done quite a bit of research on this one and I'm still stumped.
I have an application that reads a text file (up to 100MB in size) one
line at at time, converts the line to XML using Castor (each line is a
specific record) and then sends a JMS message for that line. After
validating the file one line at a time (never reading the entire
contents into memory), I am then confident I can perform the Castor
transformation / send operation. I'm doing something like the
following:

I'm not much interested in analyzing "something like" what you're doing,
as there is a reasonably good chance that the ways it differs from what
you are *actually* doing include the source of your problem. Post a
compilable example that exhibits the (mis-)behavior that is troubling you.
BufferedReader reader = new BufferedReader(new FileReader(validFile));
//for each line in the file
for (String line; (line = reader.readLine()) != null;) {
//perform transformation and send
IMessage message = transformer.createMessage(line, msgSelector);
sendMessage(message);
messageSentCount++;
//perform cleanup / logging every 500th message
if (messageSentCount % 500 == 0) {
log.debug("sent message: "+messageSentCount);
log.debug(" - Garbage collecting.");
try {
this.finalize();

Even though I'm not very keen to analyze your code, I can't help
commenting on this. You should _never_ invoke an object's finalize()
method from user code. It is for the use of the GC. If you have
cleanup code that you want to execute periodically then put it in its
own method; it is OK for finalize() to invoke such a method, if need be.
(It is better, however, to not rely on the finalizer for anything.)
At best, putting such code into finalize() is potentially confusing.
Overriding finalize() at all has an effect on GC of instances of the
relevant class, although how serious the implications are will depend on
a wide variety of factors.
} catch (Throwable t) {
log.warn("Could not finalize - keep on reading anyhow");
}

And I have to comment on that, too. It's almost never a good idea to
write such generic catch blocks. That will catch all manner or checked
and unchecked Exceptions, as well as all Errors, and ignore them. At
the very, very least you should log the Throwable's message. Much
better, however, is to only catch the specific exceptions that you have
reason to expect may be thrown. You can be reasonably confident that
you know how to handle those appropriately, but you have no reason for
confidence that you know how to handle any other Throwable.
}
}
reader.close();


Does anyone see any problems with reading the files one line at a time
in this manner (using the readLine() method)? I seem to hit an
OutofMemoryException right around line 315,000. Is the readLine()
method interally not efficient to use?

That would be an OutOfMemoryError. If you are getting one then it
probably means that your program is caching objects (messages, strings,
something) somehow. It might, however, mean that your input is corrupt,
and at some point contains a very long sequence of bytes without a line
delimiter -- the system could be trying to construct a multi-megabyte
String object or JMS message.
In the archives I've seen the approach of reading chunks of the file
with a buffer, and then determining each line by seaching for carriage
returns or line breaks. Anyone have any thoughts on this?

Your BufferedReader does that for you already.


John Bollinger
(e-mail address removed)
 
B

Brock Heinz

Boudewijn Dijkstra said:
log.warn("Could not finalize - keep on reading anyhow");


What happens with the IMessage object after it is sent?

The message is set to null in the sendMessage() method.

My initial thought that the readline() was inefficient compared to
other I/O strategies, but after running the same test without sending
any messages it appears as though that is not the source of my memory
woes... I'll keep digging, and if I turn up anything interesting and
worth posting - I'll share it here.

Brock
 
A

Ann

My initial thought that the readline() was inefficient compared to
other I/O strategies, but after running the same test without sending
any messages it appears as though that is not the source of my memory
woes... I'll keep digging, and if I turn up anything interesting and
worth posting - I'll share it here.

Brock

But since a String is imutible, doesn't Java have to
create a new String for 'line' each time readline() is
executed?
 
E

Eric Sosman

Ann said:
But since a String is imutible, doesn't Java have to
create a new String for 'line' each time readline() is
executed?

Strings are immutable (note the spelling), but
not immortal. The Strings created by readLine() are
subject to garbage collection when they are no longer
referenced, just like any other objects.
 
B

Brock Heinz

John C. Bollinger said:
Even though I'm not very keen to analyze your code, I can't help
commenting on this. You should _never_ invoke an object's finalize()
method from user code. It is for the use of the GC. If you have
cleanup code that you want to execute periodically then put it in its
own method; it is OK for finalize() to invoke such a method, if need be.

I had considered this, but since the app is running in a J2EE server,
I wasn't sure what the consequences of calling System.gc() would be.
Really - by me programatically executing any type of garbage
collection, I am really just placing a bandaid over a gash.
(It is better, however, to not rely on the finalizer for anything.)
At best, putting such code into finalize() is potentially confusing.
Overriding finalize() at all has an effect on GC of instances of the
relevant class, although how serious the implications are will depend on
a wide variety of factors.


And I have to comment on that, too. It's almost never a good idea to
write such generic catch blocks.

I agree, but the the finalize() method throws 'Throwable' :)

This is an instance where regardless of any exceptions occurred from
trying to 'finalize', I wanted to stay within the for block and
continue to process the messages.
That will catch all manner or checked
and unchecked Exceptions, as well as all Errors, and ignore them. At
the very, very least you should log the Throwable's message. Much
better, however, is to only catch the specific exceptions that you have
reason to expect may be thrown.

Again, I agree. I didn't send you the entire method. The try/catch
block that I had pasted into my post was nested in a larger try/catch
where I would catch specific exceptions and I could react accordingly.

You can be reasonably confident that
you know how to handle those appropriately, but you have no reason for
confidence that you know how to handle any other Throwable.


That would be an OutOfMemoryError. If you are getting one then it
probably means that your program is caching objects (messages, strings,
something) somehow. It might, however, mean that your input is corrupt,
and at some point contains a very long sequence of bytes without a line
delimiter -- the system could be trying to construct a multi-megabyte
String object or JMS message.

After more researching into the problem, I finally cornered the issue.
The true source of the problem wasn't me validating / parsing the
file. The source of the problem was in the third party messaging
framework we were using.
Your BufferedReader does that for you already.


John Bollinger
(e-mail address removed)

Thanks for the feedback, John!

Brock
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top