frequency list

J

jason_box

I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.


% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab. I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet. Thank you time.
 
R

Richard Heathfield

jason_box said:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

If you just want to know how many words there are in all, it's pretty easy,
yes. If you wish to record the frequency of each distinct word, it's not as
simple as it sounds. I don't see anyone doing it in fifteen lines, without
using either extremely long lines or, perhaps, non-standard language
extensions (such as, say, C++).

I recommend that you do /not/ try, just yet, to try the latter. Instead,
just write a program to count words. This is easy to do. Start a counter at
0, and record the current "state" of your input - which should be "not in a
word" - i.e. 0. Whenever you hit a non-whitespace character, check the
state. If it's 0 (i.e. not in a word), change it to 1 (i.e. in a word) and
add 1 to the counter. Otherwise, do nothing. Whenever you hit a whitespace
character, change the state to 0.

#include <ctype.h> to get the prototype for isspace().

Once you've dealt with all the input, print the counter.
 
A

Al Balmer

I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.


% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab.

Your first line says "frequency of words", but your example looks like
frequency of characters. Which is it?
I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet.

Do try. Then you can come back for help and critique.
 
R

Richard Heathfield

jason_box said:
Oh sorry about that, yes it is a frequency of characters in the input
file.

#include <limits.h>
#include <stdio.h>
int main(void){
unsigned long c[UCHAR_MAX + 1]={0};int ch;while((ch=getchar())!=EOF)++c[ch];
for(ch=0;ch<256;ch++)if(c[ch]>0)printf("%d:%lu\n",ch,c[ch]);return 0;}
 
J

jason_box

What is the lenght of an unsigned long versus using a standard long for
the character array?

Richard said:
jason_box said:
Oh sorry about that, yes it is a frequency of characters in the input
file.

#include <limits.h>
#include <stdio.h>
int main(void){
unsigned long c[UCHAR_MAX + 1]={0};int ch;while((ch=getchar())!=EOF)++c[ch];
for(ch=0;ch<256;ch++)if(c[ch]>0)printf("%d:%lu\n",ch,c[ch]);return 0;}


--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
 
R

Richard Heathfield

jason_box said:
What is the lenght of an unsigned long versus using a standard long

An standard long is sizeof(long) bytes in size, whereas an unsigned long,
which is just as standard as a standard long, is sizeof(unsigned long)
bytes in size.
for the character array?

What character array?
 
J

jason_box

Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list. I read that there can be many ways of doing this and even
as simple as not using any crazy data structures to get the encoding. I
know for the huffman tree, it uses a tree structure and generates a
node and eventually combines it with a root and left traversal would
denote a '0' and right traversal would denote '1' and with the
traversal to a specific node, aka a character is your encoding. Guess
now I have to find a smart way of encoding the list that is given into
the huffman codes.
 
A

Andrew Poelstra

jason_box said:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.


% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab. I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet. Thank you time.

Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.
 
R

Richard Heathfield

jason_box said:
Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list.

That's pretty trivial; and having seen you display such mastery over the
frequency list program, I don't think you'll have any trouble doing it
yourself.

Just as a brief hint, though, Huffman encoding works like this: if you have
n bytes in your data and a particular character c occurs f times, then
character c should be represented in the Huffman tree by a bit sequence
log2(n / f) bits long (except that you might want to reserve one relatively
short bit sequence as a "quote" sequence for characters that are very rare
but nevertheless present).
 
J

jason_box

I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.
 
J

jason_box

I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.
 
R

Richard Heathfield

jason_box said:
I guess the generation of the huffman code is where it'll get a little
complicated.

No, not really.
Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier.

Look up "trie".
 
J

jason_box

I've worked with Trie's before and I didn't have a good time with them,
though that is one good structure to use since you can traverse down
the tree and each node holds a character and there can be extra param
to encode 0 or 1. I'm going to look for a more basic structure, maybe
priority queue would work. Thanks for your help.
 
K

Keith Thompson

jason_box said:
I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.

You posted two copies of the same article, less than a minute apart.
Apparently there's some flaw in Google Groups that encourages this
kind of mistake.

Also, please don't top-post. Your response goes after, or mixed with,
any quoted text, and the quoted text should be trimmed to just what's
relevant to your response. In particular, don't quote the signature
unless you're commenting on it. See most of the articles in this
newsgroup for examples; see also <http://www.caliburn.nl/topposting.html>.
 
A

Al Balmer

Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.

Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.
 
A

Andrew Poelstra

Al said:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.

Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.
That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.
 
J

jason_box

From reading some more in our book, I think a min-heap would be the
best to use? or would ti be a max heap since at the end of the
encoding, the max value will be computed from the two previous
children? Also I had a question on how i could store the values
generated by the frequency list that is read in through stdin. I have
the first column as the characters in ascii value adn the second column
thhe frequency and I would liek to store it somehow that they would
always link to each other. What would be teh best method to do the
follwing?

Thank you.

Andrew said:
Al said:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.

Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.
That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.
 
J

jason_box

From reading some more in our book, I think a min-heap would be the
best to use? or would ti be a max heap since at the end of the
encoding, the max value will be computed from the two previous
children? Also I had a question on how i could store the values
generated by the frequency list that is read in through stdin. I have
the first column as the characters in ascii value adn the second column
thhe frequency and I would liek to store it somehow that they would
always link to each other. What would be teh best method to do the
follwing?

Thank you.

Andrew said:
Al said:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.

Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.
That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,965
Members
47,511
Latest member
svareza

Latest Threads

Top