Java vs C++ speed (IO & Sorting)

B

Bo Persson

Razii said:
My Java version, U++ version, D version are all doing the same
thing, creating strings from bytes. What are you talking about?

In the last case we were parsing lines as strings and sorting them.
That took far less time than reading and writing the file.

In this case...

(1) We are not counting the output time.
(2) it takes time to parse each word (instead of whole line) as a
distinct string. (this takes most of the time in this benchmark)
(3) You need to save the new word/string and increment the count
each time the word is found again (i.e use some kind of map
container). (this also takes most of the time).
(4) Sort the list -- which in C++ version is already sorted due to
using sorted map.

Number 2 and number 3 take more time than reading the file itself.

Ok, so then why did you use a memory mapped file to improve the Java
version?

You have just proved that good Java code can be faster than bad C++
code. What's new here?


Bo Persson
 
R

Razii

Ok, so then why did you use a memory mapped file to improve the Java
version?

I have tried it without mapping (just buffered reader)
and it's still around 3000 ms (vs 5600 ms for C++)
You have just proved that good Java code can be faster than bad C++
code. What's new here?

You have proven that you don't know how to fix C++ version. Otherwise
you would have done it already.
 
S

stan

Mirek said:
Well, now this sounds like a bunch of very childish excuses to me...
Of course, Razii's posts are somewhat annoying, but IMO C++ community
should take these issues a little bit more seriously. It is way too
simple to outperfom C++ nowadays with langauges supposed to be much
slower. It would be ridiculous if C++ gains the "legacy language
status" just because it looks slow...

You must have missed several points. "ADULTS" respond differently to
childish taunts, other childish people respond like children. NO
competant programer is going to waste time on nonsense such as
meaningless benchmarks.

Again, competant programmers use many tools and
choose the proper tool for the job. This thread hasn't created even a
hint at anything useful other than childish worries that one of the
tools will be called "legacy language status" by someone. C is even
older than C++ and it's clearly not going away, but even if C and C++ do
go away and programmers are forced to move to other tools to get the job
done, then so be it. They are tools.

Programmers face many complex issues every day and many that are not
under a programmers control. Projects make decisions for many reasons
and speed isn't always the ultimate factor, nor is size. They are
certainly factors in many projects but there are a lot of others. In a
situation that calls for starting a program, performing a task, and
shutting down the program then obviously the startup times of
candidates are relevant and it is inappropriate to ignore the JVM
startup time. In other situations, it is appropriate to ignore the
startup times. In every case a meaningful benchmark has a specific well
defined context.

Real programmers on real projects won't bother with simplistic attempts
at benchmarking. They profile and measure. If the standard library turns
out too slow for a specific task then another solution is sought. If
your attempt to fix the STL proves to be a useful tool the programmers
will put it in their toolbox.

None of that is really relevant here though. This groups purpose is to
focus on C++ and not C++ vs Java or in fact C++ vs "insert your favorite
here" Well done or not, comparison benchmarks are more appropriate
elsewhere. You can safely assume that prpgrammers in this group are
using c++ for any one of many reasons but the important point is that the
choice has been made. People being paid to program won't switch to "your
favorite language" even if a couple of rude kids post toy benchmarks.
They won't switch because kids say mean or taunting things. In fact
they won't switch even if you post toy benchmarks and sweeten the pot
with giving away hats that say "Java Rules". Many here are professional
programmers using c++ as a tool and are aware of the tools squeaky
wheels.
And, after all, I did posted a C++ code that Razii is completely
unable to beat... :)

How wonderful for you, and how sad at the same time. You choose to feed
the troll and act as rude as he. You choose to continue a crosposted off
topic thread even after you've been reminded of netiquite. You choose to
not only respond to childish taunts, but to join in the game. How truly
sad.

If you are truly concerned about the perception of C++, then why not
take this thread over the the newly created group alt.comp.lang.c++.misc
where I think it may be apropriate; I haven't read up on the group.
 
R

Razii

How wonderful for you, and how sad at the same time. You choose to feed
the troll and act as rude as he. You choose to continue a crosposted off
topic thread even after you've been reminded of netiquite. You choose to
not only respond to childish taunts, but to join in the game. How truly
sad.

I suggest, Mirek Fidler, totally ignore this troll know stan. He has
some kind of crush on me. All his posts to this newsgroup are about me
(he has nothing else to contribute here). He obsessively reads all my
posts; he has posted to all my threads. He has responded to each and
every post of mine which was a response to him (a typical behavior of
a troll who tries his best to get the last word).

Watch. He will respond to this post too and call me a troll. The guy
is a hypocrite too. He will ask you to ignore me but he will respond
to this post (as he has done with each and every post that was
response to him). I suggest ignore him. Don't feed the trolls.
 
M

Mirek Fidler

competant programer is going to waste time on nonsense such as
meaningless benchmarks.

What was so meaningless about the benchmark?

If Java supposedly performs better than C++, I the hell want to know
why....
Again, competant programmers use many tools and
choose the proper tool for the job. This thread hasn't created even a
hint at anything useful other than childish worries that one of the
tools will be called "legacy language status" by someone. C is even
older than C++ and it's clearly not going away, but even if C and C++ do
go away and programmers are forced to move to other tools to get the job
done, then so be it. They are tools.

Ah, I see. You are the kind of programmer that goes each morning to
the office, spends there 8 hours "solving problems using tools" takes
his salary and leaves home, forgetting anything that he run into
during the day.

Well, I am sorry. After 20 years of professional carrier as C++ coder,
I am still quite passionate about IT and still want to now what, how
and why.
Real programmers on real projects won't bother with simplistic attempts
at benchmarking.

Simplisitic attempts at benchmarking often lead to real results. Trust
me. Was there, done that.
They profile and measure. If the standard library turns
out too slow for a specific task then another solution is sought.

But by then, it is usually too late.
None of that is really relevant here though. This groups purpose is to
focus on C++ and not C++ vs Java or in fact C++ vs "insert your favorite

Knowing why C++ code runs slower than Java is IMO very relevant
question.
If you are truly concerned about the perception of C++, then why not
take this thread over the the newly created group alt.comp.lang.c++.misc
where I think it may be apropriate; I haven't read up on the group.

But you still seem to read this thread :) You can ignore it after all.
There is much more spam in the group that you have to ignore anyway.

Meanwhile, I will study and exchange ideas, makeing virtually most of
my code running faster again. Worth the time spend, IMO.

Mirek
 
R

Razii

But you still seem to read this thread :) You can ignore it after all.
There is much more spam in the group that you have to ignore anyway.

Exactly. There is zillion of spam on the newsgroup but stan only has
obsession with me personally (all his posts to this newsgroup are in
my threads or in response to me. His posts are not even on topic; they
are ad hominem insults and diatribe directed at me, personally).

Anyway, ignore stan. Where are your numbers that you were going to
post? The final java version is this:

http://pastebin.com/f2af3a5c8
 
S

stan

Razii said:
Just as I predicted, in most situations using hash map is much faster
than using tree map:

This was C++ version that used map

http://www.pastebin.ca/965243

after changing it to tr1/unordered_map and a few other changes

http://pastebin.com/f64eaec59

that's 2 times.

Conclusion: don't use <map>. Use <unordered_map> (i.e HashMap).

With the credibility you've earned, I'm sure your ideas will receive
their due worth.

I was serious when I suggested you might want to take a class or
something to learn about algorithms. When you talk to people
who program for a living about their craft you really do want to have
some real knowledge when you come to the table.

Of course if your purpose is to simply troll then you already meet the
qualifications, education would be wasted.
 
S

stan

Razii said:
Another post filled with ad hominem insults and personal attacks by
the clown "stan" (who has no history in this newsgroup and only
started posting here regularly, apparently, because of me). This clown
has some sort of crush on me and mostly posts in the thread started by
me.

You might want to look up ad hominem before you try to use the phrase.
You seem to be distrubed by having your bad behavior pointed out. Most
trolls ignore when people correct them. You're not very good at
trolling behavior. Is it new to you?

I will add a hint for you. Refering to your behavior is very different
from calling names as you have done. In the quoted post above you note
my observation and then proceed to call me names. Your reply is an ad
hominem attack, My observation would be at worst an opinion regarding
your observed bahavior and by many an accurate statement of fact. It's
an inescapable fact that behavior effects credibility. There may be
people who think you have earned good credibility, my statement doesn't
discount them. I'll add that as far as personal attacks go, the above is
pretty juvenile.

I'm also not surprised that you don't know enough about usenet to know
how limited google is. I have several accounts where I post; at home, at
school, at several work sites, etc... Most of those sites have different
user names for various reasons dictated by others. Not that I care what
you think, but there may be others as naive as you who believe googles
statistics are accurate or complete.
I am about to PLONK him though -- good reason to test filters that
come with news clients.

I stated several times that I don't have high hopes that you will "see
the light" and that I hope there's a possibility that others might be
prevented from acting like you. The only effect your plonk would have is
to stop your ad hominem attacks. While they don't bother me, I guess in
some way the group is a better place without name calling.
 
A

asterisc

Ok, if I get this you are using a memory mapped file in yourJava
version.

Well, I said ignore that version since it was missing words (due to a
stupid typo). This is the one that is working

http://pastebin.com/d48680a60

Time for 40 meg file

Time: 2281 ms

for C++..

C:\>wc1 bible2.txt
Time: 5421 ms

In fact

C:\>java-server WordCount3 bible2.txt bible2.txt
Time: 4344 ms

even with two bible2.txt, it's still faster than c++ with one
bible2.txt
Ok, if I get this you are using a memory mapped file in yourJava
version.

Well then change the following in C++ version to mapping the file...
std::ifstream input_file( argv );
std::eek:stringstream buffer;
buffer << input_file.rdbuf();
std::string input( buffer.str() );


Here are two links:
test2 -> http://pastebin.com/f6a851e86
test3 -> http://pastebin.com/f68a0b61c
Give them a try on your machine and publish the results.
 
A

asterisc

The first one was about the same and second one didn't compile.
Anyway, the C++ version was fixed already by using unsorted map and
some other changes

try this version: http://pastebin.com/f64eaec59(which is slightly
faster than java version by about 100 ms).

Also, someone posted trie c++ version that didn't use any kind of map.
It was fastest with only around 1200 ms on my comp.

test3 -> http://pastebin.com/m52fc493e (on my gcc 4.1.2 worked before,
but now it should compile on VC++ as well. I tried it on vc++ 7.1
(2003) and vc++ 2008 express edition.
It has some warnings but I don't have time right now to fix them. It
should work smooth as it is now.

Can you please post the times on your machine?

I also see that the last version you posted for c++ (the one which is
fastest than java) uses std::string as well. I'm wondering what if
would uses char * ? ( Only changing std::string to char * I boost up
my program from 660ms to 330ms, on my machine).
 
A

asterisc

It only compiled with VC++ not GCC

(snip)
1, zeal+?2
1, zeal+?2
1, zeal+?2
1, zeal+?2
1, zeal-?2
1, zeala?2
1, zealF?2
1, zealF?2
Time: 6062ms

what the heck? What kind of output is that?


It *was* faster than java version but I already have a new version
that is faster but I am wondering how to improve it further before I
post it.


330 ms? Are you using 40MB test file?

No, 330ms was for the bible.txt (3.9MB AFAIK). But I'm using a laptop
with a 4000rpms HDD so the HDD I/O is not that great, that's why I
asked for the times on your machine. It's strange that you say it
isn't working correctly. On my Gentoo box it works just fine, no
warnings and a correct output. I will give it a try tomorrow and
correct it for Windows, if necessary.
 
M

Mirek Fidler

Well, Mirek Fidler, what do you say now? :)

Hehe, nice trick :)

Should I extend the problem redefining the definition of word by
allowing all win-1252 letters in them? (basically, you will have 2*26
+ 128 letters :)

Still, you have about 50% of performance to go.

Also, you should change the testing file; 10x same file pasted
together nicely avoids memory allocation bottleneck here. Put together
10 different files to get 40MB big one, pereferably in different
languages, and my guts feeling is that you are likely to receive a bit
different results. (With your testing file and used code, only first
file occurence is really relevant - the whole bechmark therefore
becomes too irrelevant to real world code).

Anyway, it is all fun, is not it? :)

Mirek
 
B

Bo Persson

asterisc said:
No, 330ms was for the bible.txt (3.9MB AFAIK). But I'm using a
laptop with a 4000rpms HDD so the HDD I/O is not that great, that's
why I asked for the times on your machine. It's strange that you
say it isn't working correctly. On my Gentoo box it works just
fine, no warnings and a correct output. I will give it a try
tomorrow and correct it for Windows, if necessary.

The part that doesn't compile portably is this

file_handle->pubseekpos( 0, ios::in );

It should be

file_handle->pubseekoff( 0, ios::beg, ios::in );


Also note that strncpy does not append a nul character to the copied
string, so whether it works or not depends on the contents of the
allocated memory. Could be cleared by the system, or not.


Bo Persson
 
A

asterisc

asterisc said:
//pastebin.com/m52fc493e(onmy[/url] gcc 4.1.2 worked
before, but now it should compile on VC++ as well. I tried it on
vc++ 7.1 (2003) and vc++ 2008 express edition.
It only compiled with VC++ not GCC
(snip)
1, zeal+?2
1, zeal+?2
1, zeal+?2
1, zeal+?2
1, zeal-?2
1, zeala?2
1, zealF?2
1, zealF?2
Time: 6062ms
what the heck? What kind of output is that?
It has some warnings but I don't have time right now to fix them.
It should work smooth as it is now.
It *was* faster than java version but I already have a new version
that is faster but I am wondering how to improve it further before
I post it.
uses std::string as well. I'm wondering what if
would uses char * ? ( Only changing std::string to char * I boost
up my program from 660ms to 330ms, on my machine).
330 ms? Are you using 40MB test file?
No, 330ms was for the bible.txt (3.9MB AFAIK). But I'm using a
laptop with a 4000rpms HDD so the HDD I/O is not that great, that's
why I asked for the times on your machine. It's strange that you
say it isn't working correctly. On my Gentoo box it works just
fine, no warnings and a correct output. I will give it a try
tomorrow and correct it for Windows, if necessary.

The part that doesn't compile portably is this

file_handle->pubseekpos( 0, ios::in );

It should be

file_handle->pubseekoff( 0, ios::beg, ios::in );

Also note that strncpy does not append a nul character to the copied
string, so whether it works or not depends on the contents of the
allocated memory. Could be cleared by the system, or not.

Bo Persson

This is the new link. Razii please give it a run an post the results.
http://pastebin.com/m3e7842c

No, pubseekpos is ok, i think is an alternative to pubseekoff. Never
used them before and I only read a bit of help.

Indeed, that was a problem: strncpy doesn't append \0 if your copy
doesn't catch the end of the source string.

I fixed the warnings as well, some casts.
 
M

Mirek Fidler

With JET (6.4 beta) it's around 1012 ms vs 828 ms...

Yeah, OTOH, you are not using the most optimal C++ setup as well. MSC/
Win32 is not a king of the hill anymore.

If you want to see optimal C++/U++ performance, install Ubuntu 64.
It would be hard to easy to find a 40 MB text file. Anyway, I created
a 23 MB file fromhttp://www.gutenberg.org/wiki/Main_Page(with far
more words than 40x bible). Well, the difference increased 609 ms vs
1011 ms.

Using JET?

Another thing you should try is to comment out output and run with
bible 10 times in a loop, counting words just for a single file - will
interesting to see what GC really does here :).

And of course, any serious benchmark should be concerned about memory
consumption too...

And there are more bad news. The customer that ordered 'wc' now
requires the result to be sorted by frequency, starting with most
often used words first. In U++ version, this si trivial:

Vector<int> order = GetSortOrder(map.GetValues());

How about your Java version? :)

Mirek
 
M

Mirek Fidler

I got the IDE from UPP site and whatever came with the package
(included g++).

By the way, UPP wc version is slower with g++

(with 20x 80mb bible3.txt)

1921 ms (g++ -- MinGW)
1650 ms (MSC9)

Yeah, but that is MinGW. You should test with gcc in its native
environment... MinGW is far less optimal.

BTW, I suggest retesting your original "sort the bible" in Linux too.
I have tested your C++ implementation on both MSC/Win32 and in Linux
and in Linux it is about 3 times faster. The reason is that std:: of
MSC sucks big time and streams suck even more.
I don't have to loop. I can enter files in command line (each file is
80 Mb 20x bible).

Well, but that obviously is not the same thing - in that case, all
words will be combined in single output. Basically, after consuming
the first file, there are no more memory allocations for the rest.
the java version uses about 3.5 more memory.

3.5 of what? 3.5x more?
The most time consuming part is reading, parsing, checking for repeat
words, and incrementing counts. Once all the words and counts are in
memory, it will take no time to transfer them to proper container and
sort them whatever way.

Well, it is my guts feeling again, but I am afraid that reconstructing
words back from your nice tree structure will not be as simple task as
you think. I might be wrong, but I think it can very well be much
longer for "single bible case" than parsing/mapping phase.

Mirek
 
M

Mirek Fidler

It took only 120 ms ( for bible.txt) to fill HashMap with words and
counts...

//time
final long start = System.currentTimeMillis();

HashMap<String, Integer > map = new HashMap<String, Integer>(1600);

Well, 120ms is quite good, but still about 15% of time... BTW, filling
HashMap will not likely help you to finish the task (sorting by
number), you should rather fill something more reasonable to be able
to sort that later... (but I guess, you could do so in even less
time).

So, what about the loop?

Mirek
 
A

asterisc

C:\>g++ -O2 -fomit-frame-pointer "wc4.cpp" -o "wc4.exe"
wc4.cpp: In member function 'bool ltstr::eek:perator()(char*, char*)
const':
wc4.cpp:13: error: 'strcmp' was not declared in this scope
wc4.cpp: In function 'int main(int, char**)':
wc4.cpp:72: error: 'strncpy' was not declared in this scope

#include <cstring> instead of <string>
 
A

asterisc

Finding a word in the tree is even quicker, probably much quicker than
UPP VectorMap. Searching a word would only take O(m)time where m is
the length of word. In the case of VectorMap, you will need to Hash it
and there could be hashing collisions which can effect search time.
Searching the tree has no effect even if it has million words (though
memory needed would be large for a tree of million words).
I am pretty sure Java version will be faster than UPP.

This is not Java vs C++ anymore. This is algorithms' comparison.
You should be able to compare it only in Java, using a hash-table vs a
char tree.

It doesn't make any sense to compare two different algorithms in two
different languages.
 
A

asterisc

it's about writing fastest word counter. U++ uses non standard
strings, map, and non std IO. How is that fair? Mirek Fidler asked
that I write java version of wc. He never said I must do it in one
way. Also, it shows that if you have better algorithm, java program
can be 6 times faster than poorly written c++ version (the original
c++ version is 5600 ms -- this java version is 1000 ms for 40 MB
file).

Better algorithms always wins, no matter in which language are they
implemented.

A quicksort in Java will always beat a bubble sort in C++ or even ASM
on a large amount of elements.

And by the way, what does non-std means? Is it a C++ valid code? It's
not in the STL?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,947
Members
47,498
Latest member
yelene6679

Latest Threads

Top