An interview question.

Matt · Jul 27, 2004

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

Emmanuel Delahaye · Jul 27, 2004

Matt wrote on 27/07/04 :

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

A classical word count program. What exactly is your C-question?

Malcolm · Jul 27, 2004

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Firstly you want to query your spec. Usually by "compression" we mean a
reversible process by which the original text is represented in a more
compact format. What you have been given is more like a concordance program.

What you have been given is a typical general assignment.

Firstly you need a definition of what is a word. Probably any sequence of
alphabetical characters or apostrophes separated by whitespace or
non-apostrophe punctuation will do.

You need to go through the input file, inputting lines and pulling out each
word. Beware that some text files contain extremely long lines, because
newline is used only as a paragraph marker.

As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

You need to ask "what running time is acceptable?" to know whether or not
they are looking for optimisation.

Julian V. Noble · Jul 27, 2004

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

You might want to read the chapter on compression in "The Turing Omnibus"
by A.K. Dewdney, for a popular introduction to the gist of this algorithm.

--
Julian V. Noble
Professor Emeritus of Physics
(e-mail address removed)
^^^^^^^^^^^^^^^^^^
http://galileo.phys.virginia.edu/~jvn/

"For there was never yet philosopher that could endure the toothache
patiently." -- Wm. Shakespeare, Much Ado about Nothing. Act v. Sc. 1.

Mabden · Jul 28, 2004

Malcolm said:
Matt said:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Click to expand...

[snip]
As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

Or just a series of linked-lists with first letter + character count. So
AAAAA goes to the a5 bucket and BBB goes to the b3 bucket (alphabetically).
Worry about comparing / counting after reading in all the words. No need to
compare, say, AAA with AAAA, might save time, and you should know the size
from parsing the input file looking for whitespace.

I'm sure it's a short list of words that would fit in memory, as most
homework usually is. ;-)

tigervamp · Jul 28, 2004

"> > I was asked to compress a text file in a way that evey word be
written

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

Rob Gamble

Malcolm · Jul 28, 2004

tigervamp said:
If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

I should have been a bit clearer. The text file is n bytes in size, so we
need to search the dictionary O(n) times. However the dictionary gets bigger
as we add words, and the bigger the text file the more unique words there
will be. For medium size English language text files the relationship will
be roughly linear, so we have O(n) words in the dictionary. A naive program
is thus O(n^2).

If you use the first-letter index method, you are still technically O(n^2),
but you have cut the search time by a factor of 26, and for practical
purposes this might well be as good as a red-black tree implementation.

kal · Jul 28, 2004

If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.

Could that "something" be that it works?

You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.

Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

OTOH, if the stream consists solely of 1 million occurrences
of the word "genius"...

OTOOH, You ought not to have beeen writing your name so many
times.

Alex Fraser · Jul 28, 2004

kal said:
(e-mail address removed) (tigervamp) wrote in message

Shouldn't it be O(n/2) on average?

O(n/2) makes as much sense as O(n(n-1)/4).

[snip]

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Alex

Thomas Matthews · Jul 28, 2004

Matt said:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in.

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book

Arthur J. O'Dwyer · Jul 28, 2004

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. [...]

Click to expand...

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in.

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

This is the way /all/ lossless compression algorithms work!
The simplest of these is ROT-26, in which the index corresponding
to a pattern P is the value of P itself. In Huffman compression,
the size of the index value of P gets shorter as P gets more
common, and longer as P gets less common. In LZW compression,
the number of patterns "tracked" with index values changes dynamically
with the contents of the input data stream.

Remember: The OP's problem is /not/ lossless compression! His
output lets us reconstruct that the input contained, say, two
"AAA"s and one "AAB", but not in which order they occurred.

F-up-to: comp.programming

-Arthur

tigervamp · Jul 28, 2004

Could that "something" be that it works?

Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Not quite sure how you come up with this. The act of searching such a
list should be O(n), the time it takes to find an item is directly
propertional to the number of items in the list. Even if on average
you only search half the list before finding the target, you will
still spend twice as long looking on a list that's twice the size.
What am I missing here?

Rob Gamble

Keith Thompson · Jul 29, 2004

(e-mail address removed) (tigervamp) wrote in message

Could that "something" be that it works?

Shouldn't it be O(n/2) on average?

n/2 *is* O(n); there's no difference between O(n/2) and O(n).

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Similarly, (n(n-1))) / 4 is O(n^2).

A Google search turns up <http://en.wikipedia.org/wiki/Big_O_notation>.

BTW, this really has nothing to do with C. You might try
comp.programming.

Javascript interview question that I couldn't solve	1	Jun 6, 2024
Write a small program using OOP? Interview	11	Jun 11, 2024
Bad interview question	18	Jun 30, 2010
Interview question	8	Nov 6, 2007
An interview question by MS	15	Nov 6, 2007
howto share a list between C++ library and appl	0	Apr 3, 2014
Math/CompSci Interview Question - Thoughts?	63	Nov 21, 2009
Tasks	1	Nov 29, 2022

An interview question.

Matt

Emmanuel Delahaye

Malcolm

Julian V. Noble

Mabden

tigervamp

Malcolm

kal

Alex Fraser

Thomas Matthews

Arthur J. O'Dwyer

tigervamp

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads