OK, so I did some tests. Results are the following (for a part of my
data file):
1-A) Just to read lines:
while ((line = in.readLine()) != null);
takes 1.9 sec
1-B) readLine() + pattern.split(line) takes 7.0 sec
2) Just tokens (which does roughly what 1-A and 1-B do together):
while ((st.nextToken()) != StreamTokenizer.TT_EOF);
takes 6.6 sec
When I add parsing e.g. Integer.parseInt() and Double.parseDouble() in
both cases I end up around 10sec. Yes, I apparently I have to do
parsing also in the case with StreamTokenizer. My input contains
strings with digits (like "Johny17") which are parsed into two
distinct tokens. So I had to switch of parsing numbers within
StreamTokenizer and to do it on my own.
Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.
Indeed, compiling such a short pattern has minimal benefit, but Eric
Sosman's parser suggestion may be worth the effort. I liked Daniel
Pitts' StreamTokenizer idea well enough to try it. It might be better
for creating a Double array:
<console>
Warmup: 30
Size: 5
RegEx: 19
Compiled: 3
Parse: 5
Token: 24
Size: 50
RegEx: 28
Compiled: 29
Parse: 14
Token: 61
Size: 500
RegEx: 280
Compiled: 276
Parse: 139
Token: 591
Size: 5000
RegEx: 3042
Compiled: 3007
Parse: 2038
Token: 8000
</console>
<code>
package cli;
import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Pattern;
/** @author JBM*/
public class RCPTest {
private static final Random random = new Random();
public static void main(String[] args) {
(new Warmup()).test(testString(1));
System.out.println();
for (int i = 1; i < 5; i++) {
int padding = (int) Math.pow(10, i) / 2;
System.out.println("Size: " + padding);
String s = testString(padding);
(new RegEx()).test(s);
(new Compiled()).test(s);
(new Parse()).test(s);
(new Token()).test(s);
System.out.println();
}
}
private static String testString(int count) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < count; i++) {
sb.append(random.nextInt());
sb.append(" ");
}
return sb.toString();
}
}
abstract class Test {
public static final int COUNT = 1000;
public void test(String in) {
long start = System.currentTimeMillis();
for (int i = 0; i < COUNT; i++) {
split(in);
}
System.out.println(name()
+ (System.currentTimeMillis() - start));
}
public abstract String[] split(String in);
public abstract String name();
}
class Warmup extends Test {
public String[] split(String in) {
return (new RegEx()).split(in);
}
public String name() {
return "Warmup: ";
}
}
class RegEx extends Test {
public String[] split(String in) {
return in.split("\\s+");
}
public String name() {
return "RegEx: ";
}
}
class Compiled extends Test {
private static final Pattern p = Pattern.compile("\\s+");
public String[] split(String in) {
return p.split(in);
}
public String name() {
return "Compiled: ";
}
}
class Parse extends Test {
public String[] split(String in) {
List<String> list = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
int len = in.length();
int i = 0;
char c;
while (i < len) {
c = in.charAt(i++);
if (c == ' ' || i == len) {
list.add(sb.toString());
sb.delete(0, len - 1);
} else {
sb.append(c);
}
}
return list.toArray(new String[0]);
}
public String name() {
return "Parse: ";
}
}
class Token extends Test {
public String[] split(String in) {
Reader reader = new StringReader(in);
StreamTokenizer tokens = new StreamTokenizer(reader);
List<String> list = new ArrayList<String>();
double d;
try {
int token = tokens.nextToken();
while (token != StreamTokenizer.TT_EOF) {
d = tokens.nval;
list.add(Double.toString(d));
token = tokens.nextToken();
}
return list.toArray(new String[0]);
} catch (IOException ex) {
ex.printStackTrace(System.err);
return new String[0];
}
}
public String name() {
return "Token: ";
}
}
</code>