An interview question.

M

Matt

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........


the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC



I appreciate any hints.



Matt
 
E

Emmanuel Delahaye

Matt wrote on 27/07/04 :
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

A classical word count program. What exactly is your C-question?
 
M

Malcolm

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........


the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
Firstly you want to query your spec. Usually by "compression" we mean a
reversible process by which the original text is represented in a more
compact format. What you have been given is more like a concordance program.

What you have been given is a typical general assignment.

Firstly you need a definition of what is a word. Probably any sequence of
alphabetical characters or apostrophes separated by whitespace or
non-apostrophe punctuation will do.

You need to go through the input file, inputting lines and pulling out each
word. Beware that some text files contain extremely long lines, because
newline is used only as a paragraph marker.

As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

You need to ask "what running time is acceptable?" to know whether or not
they are looking for optimisation.
 
J

Julian V. Noble

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

You might want to read the chapter on compression in "The Turing Omnibus"
by A.K. Dewdney, for a popular introduction to the gist of this algorithm.

--
Julian V. Noble
Professor Emeritus of Physics
(e-mail address removed)
^^^^^^^^^^^^^^^^^^
http://galileo.phys.virginia.edu/~jvn/

"For there was never yet philosopher that could endure the toothache
patiently." -- Wm. Shakespeare, Much Ado about Nothing. Act v. Sc. 1.
 
M

Mabden

Malcolm said:
Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........


the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
[snip]
As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

Or just a series of linked-lists with first letter + character count. So
AAAAA goes to the a5 bucket and BBB goes to the b3 bucket (alphabetically).
Worry about comparing / counting after reading in all the words. No need to
compare, say, AAA with AAAA, might save time, and you should know the size
from parsing the input file looking for whitespace.

I'm sure it's a short list of words that would fit in memory, as most
homework usually is. ;-)
 
T

tigervamp

"> > I was asked to compress a text file in a way that evey word be
written
The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

Rob Gamble
 
M

Malcolm

tigervamp said:
If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.
I should have been a bit clearer. The text file is n bytes in size, so we
need to search the dictionary O(n) times. However the dictionary gets bigger
as we add words, and the bigger the text file the more unique words there
will be. For medium size English language text files the relationship will
be roughly linear, so we have O(n) words in the dictionary. A naive program
is thus O(n^2).

If you use the first-letter index method, you are still technically O(n^2),
but you have cut the search time by a factor of 26, and for practical
purposes this might well be as good as a red-black tree implementation.
 
K

kal

If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.

Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.

Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

OTOH, if the stream consists solely of 1 million occurrences
of the word "genius"...

OTOOH, You ought not to have beeen writing your name so many
times.
 
A

Alex Fraser

kal said:
(e-mail address removed) (tigervamp) wrote in message


Shouldn't it be O(n/2) on average?

O(n/2) makes as much sense as O(n(n-1)/4).

[snip]
Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Alex
 
T

Thomas Matthews

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........


the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC



I appreciate any hints.



Matt

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
A

Arthur J. O'Dwyer

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. [...]

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

This is the way /all/ lossless compression algorithms work!
The simplest of these is ROT-26, in which the index corresponding
to a pattern P is the value of P itself. In Huffman compression,
the size of the index value of P gets shorter as P gets more
common, and longer as P gets less common. In LZW compression,
the number of patterns "tracked" with index values changes dynamically
with the contents of the input data stream.

Remember: The OP's problem is /not/ lossless compression! His
output lets us reconstruct that the input contained, say, two
"AAA"s and one "AAB", but not in which order they occurred.

F-up-to: comp.programming

-Arthur
 
T

tigervamp

Could that "something" be that it works?


Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Not quite sure how you come up with this. The act of searching such a
list should be O(n), the time it takes to find an item is directly
propertional to the number of items in the list. Even if on average
you only search half the list before finding the target, you will
still spend twice as long looking on a list that's twice the size.
What am I missing here?

Rob Gamble
 
K

Keith Thompson

(e-mail address removed) (tigervamp) wrote in message


Could that "something" be that it works?


Shouldn't it be O(n/2) on average?

n/2 *is* O(n); there's no difference between O(n/2) and O(n).
Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Similarly, (n(n-1))) / 4 is O(n^2).

A Google search turns up <http://en.wikipedia.org/wiki/Big_O_notation>.

BTW, this really has nothing to do with C. You might try
comp.programming.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,145
Messages
2,570,826
Members
47,371
Latest member
Brkaa

Latest Threads

Top