Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:
If this is the original file:
---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC
I appreciate any hints.
Firstly you want to query your spec. Usually by "compression" we mean a
reversible process by which the original text is represented in a more
compact format. What you have been given is more like a concordance program.
What you have been given is a typical general assignment.
Firstly you need a definition of what is a word. Probably any sequence of
alphabetical characters or apostrophes separated by whitespace or
non-apostrophe punctuation will do.
You need to go through the input file, inputting lines and pulling out each
word. Beware that some text files contain extremely long lines, because
newline is used only as a paragraph marker.
As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.
The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.
You need to ask "what running time is acceptable?" to know whether or not
they are looking for optimisation.