H
Hendrik Maryns
Hi all,
I have little proggie that queries large linguistic corpora. To make
the data searchable, I do some preprocessing on the corpus file. I now
start getting into trouble when those files are big. Big means over 40
MB, which isn’t even that big, come to think of it.
So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):
final BufferedReader corpus = new BufferedReader(new FileReader(inFile));
final ObjectOutputStream treeOut = new ObjectOutputStream(new
BufferedOutputStream(new FileOutputStream(treeFile)));
final int nbTrees = TreebankConverter.parseNegraTrees(corpus, treeOut);
try {
treeOut.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
try {
corpus.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
parseNegraTrees then does the following: it scans through the input
file, constructs trees that are described in it in some text format
(NEGRA), converts those trees to a binary format, and writes them as
Java objects to the treeFile. Each of those trees consists of nodes
with a left daughter, a right daughter and a list of strings of length
at most 5. And those are short strings: words or abbreviations. So
this shouldn’t take too much memory, I would think.
This is also done one by one:
TreebankConverter.skipHeader(corpus);
String bosLine;
while ((bosLine = corpus.readLine()) != null) {
final StringTokenizer tokens = new StringTokenizer(bosLine);
final String treeIdLine = tokens.nextToken();
if (!treeIdLine.equals("%%")) {
final String treeId = tokens.nextToken();
final NodeSet forest = parseSentenceNodes(corpus);
final Node root = forest.toTree();
final BinaryNode binRoot = root.toBinaryTree(new ArrayList<Node>(), 0);
final BinaryTree binTree = new BinaryTree(binRoot, treeId);
treeOut.writeObject(binTree);
}
}
I see no reason in the above code why the GC wouldn’t discard the trees
that have been constructed before.
So the only place for memory problems I see here is the file access.
However, as I grasp from the Javadocs, both FileReader and
FileOutputStream are, indeed streams, that do not have to remember what
came before. Is the buffering the problem, maybe?
TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFH3loDe+7xMGD3itQRAk06AKCAuQKLbrbGkExVP20J49E6U470qACdE5pq
HiASjlcBPxQWxV6zUbpMvd8=
=m33o
-----END PGP SIGNATURE-----
I have little proggie that queries large linguistic corpora. To make
the data searchable, I do some preprocessing on the corpus file. I now
start getting into trouble when those files are big. Big means over 40
MB, which isn’t even that big, come to think of it.
So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):
final BufferedReader corpus = new BufferedReader(new FileReader(inFile));
final ObjectOutputStream treeOut = new ObjectOutputStream(new
BufferedOutputStream(new FileOutputStream(treeFile)));
final int nbTrees = TreebankConverter.parseNegraTrees(corpus, treeOut);
try {
treeOut.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
try {
corpus.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
parseNegraTrees then does the following: it scans through the input
file, constructs trees that are described in it in some text format
(NEGRA), converts those trees to a binary format, and writes them as
Java objects to the treeFile. Each of those trees consists of nodes
with a left daughter, a right daughter and a list of strings of length
at most 5. And those are short strings: words or abbreviations. So
this shouldn’t take too much memory, I would think.
This is also done one by one:
TreebankConverter.skipHeader(corpus);
String bosLine;
while ((bosLine = corpus.readLine()) != null) {
final StringTokenizer tokens = new StringTokenizer(bosLine);
final String treeIdLine = tokens.nextToken();
if (!treeIdLine.equals("%%")) {
final String treeId = tokens.nextToken();
final NodeSet forest = parseSentenceNodes(corpus);
final Node root = forest.toTree();
final BinaryNode binRoot = root.toBinaryTree(new ArrayList<Node>(), 0);
final BinaryTree binTree = new BinaryTree(binRoot, treeId);
treeOut.writeObject(binTree);
}
}
I see no reason in the above code why the GC wouldn’t discard the trees
that have been constructed before.
So the only place for memory problems I see here is the file access.
However, as I grasp from the Javadocs, both FileReader and
FileOutputStream are, indeed streams, that do not have to remember what
came before. Is the buffering the problem, maybe?
TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFH3loDe+7xMGD3itQRAk06AKCAuQKLbrbGkExVP20J49E6U470qACdE5pq
HiASjlcBPxQWxV6zUbpMvd8=
=m33o
-----END PGP SIGNATURE-----