need help ...

N

Nikola

I need a function that reads from a txt file and randomly chooses a line
from which retrieves a string (with spaces) and returns it to main function.

thx
 
O

osmium

Nikola said:
I need a function that reads from a txt file and randomly chooses a line
from which retrieves a string (with spaces) and returns it to main
function.

I assume this is a student problem. Try breaking the problem down. You can
not choose a random line until you know how many lines there are. So you
must start by counting the lines. To count the lines you will have to read
the lines. To read the lines you will have to open the file. To open the
file you must know the file name. And so on. Writing things down in some
form can be a great help. The preferred method is called pseudocode but any
form at all is better than just fussing and flailing.
 
T

those who know me have no need of my name

in comp.lang.c i read:
I need a function that reads from a txt file and randomly chooses a line
from which retrieves a string (with spaces) and returns it to main function.

i bet you do. when is your homework due?
 
E

Eric Sosman

osmium said:
Nikola writes:



function.

I assume this is a student problem. Try breaking the problem down. You can
not choose a random line until you know how many lines there are.

There are perfectly good ways to choose a line at random
without knowing in advance how many there are. In fact, you can
choose a random sample of N lines without advance knowledge; the
case N==1 is particularly easy, but N>1 isn't difficult. See
Knuth "The Art of Computer Programming, Volume II: Seminumerical
Algorithms," section 3.4.2, Algorithm R.

(Well, I guess you do in fact need *some* advance knowledge.
Specifically, you need to know that the source of random numbers
can deliver a different value for every line in the file, without
repetitions. A PRNG that will deliver two billion distinct
"good" numbers is easy to write in portable C; four billion is
also easy at some small sacrifice of "goodness." So the advance
knowledge amounts to knowing that the file holds fewer than four
billion lines; I think this may be considered axiomatic in the
context of a simple homework assignment.)
 
N

Nikola

No, it's not a home work and yes, I am a begginer (weren't we all??) and
trying to do a little code of my own - was told that ambition is a virtue
:)
 
D

Default User

Nikola said:
No, it's not a home work and yes, I am a begginer (weren't we all??) and
trying to do a little code of my own - was told that ambition is a virtue
:)

Then you need to show us this "little code". Just giving us the problem
spec and requesting an answer doesn't really display much in the way of
ambition.

What have you figured out so far? What is your analysis of the problem?
Do you know how to open a file? Read out a line? How would you choose a
random line?





Brian Rodenborn
 
A

Arthur J. O'Dwyer

There are perfectly good ways to choose a line at random
without knowing in advance how many there are.
(Well, I guess you do in fact need *some* advance knowledge.
Specifically, you need to know that the source of random numbers
can deliver a different value for every line in the file, without
repetitions.

...Or, I think more generally, you can be satisfied with a RNG
that will return an unbiased random number in the range 1..n, for
every n in the range 1..N where N is the number of lines in the
file. That may be what you were getting at; after all, if the
RNG can deliver a number in the range 1..N, then in some sense it
must be able to deliver N different numbers "without repetition."
Of course, it doesn't need to generate its random numbers without
repeating itself --- that would be a pretty bad RNG! ;)

Okay, enough hints from me now. :)

-Arthur
 
N

Nikola

What have you figured out so far? What is your analysis of the problem?
Do you know how to open a file? Read out a line? How would you choose a
random line?

Well, nothing anyone said didn't help so I kept trying ...
If there are some noncence-mistakes in my code that no self-respectfull
programmer would make
keep in mind that I'm just 19, meaning, a begginer
this code should read the random line of an file
commented parts were just a unsuccessfull toughts....
it's not in english this is the translation of some words:

znak-character
br_redaka-number of lines
brojac-counter
pojam-randomly chosen line (string)

this is ment to be a function that returns a string ...

FILE *dat;
char pojam[50],c,znak;
int br_redaka=1,brojac=1,rand_br;
dat=fopen("film.txt","r");
while(c!=char(EOF))
{
c=getc(dat);
if(c==char(10)) //10 - ascii for new line
br_redaka=br_redaka+1;
}
//generiranje nasumicnog broja od 0 do broja redaka //moguci BUG zbog
nule!
srand(time(NULL));
rand_br=rand() % br_redaka;
c='b';
while(c!=EOF)
{
/*c=getc(dat);
if(c==char(10)&&brojac==rand_br) //10-ascii kod za novi redak
{
brojac=brojac+1; //rand_br od 0 do br_redaka
fgets(pojam,50,dat);
}*/
c=getc(dat);
if(c==char(10)&&brojac==rand_br)
{
brojac=brojac+1;
while(znak!='\\')
//strcat(pojam,znak);
pojam[brojac-2]=c;

}
}
printf("%s",pojam);
 
E

Eric Sosman

Arthur said:
...Or, I think more generally, you can be satisfied with a RNG
that will return an unbiased random number in the range 1..n, for
every n in the range 1..N where N is the number of lines in the
file. That may be what you were getting at; after all, if the
RNG can deliver a number in the range 1..N, then in some sense it
must be able to deliver N different numbers "without repetition."

The algorithm I'm thinking of actually needs N distinct
values. Simply knowing that the range of generated numbers
contains more than N values is not good enough, nor is knowing
that the period of the PRNG exceeds N. You actually need a
"no duplicates among the first N values" guarantee.

At the risk of becoming almost topical: rand() itself lacks
such a guarantee, but it would be possible to write a wrapper
around rand() to get this behavior.
Of course, it doesn't need to generate its random numbers without
repeating itself --- that would be a pretty bad RNG! ;)

I don't understand what this means.
 
N

Nick Austin

Well, nothing anyone said didn't help so I kept trying ...
If there are some noncence-mistakes in my code that no self-respectfull
programmer would make
keep in mind that I'm just 19, meaning, a begginer
this code should read the random line of an file
commented parts were just a unsuccessfull toughts....
it's not in english this is the translation of some words:

znak-character
br_redaka-number of lines
brojac-counter
pojam-randomly chosen line (string)

this is ment to be a function that returns a string ...

FILE *dat;
char pojam[50],c,znak;
int br_redaka=1,brojac=1,rand_br;
dat=fopen("film.txt","r");
while(c!=char(EOF))

I see three problems.

Firstly char(EOF) is a syntax error.
Secondly getc() returns an int, not char.
Thirdly object c is used before it has been initialised.

Also testing for EOF before calling getc() is bad style. It
results in the loop executing one too many times, although
in your case the result of this is harmless. It is better
to write this loop as:

while (c=getc(dat), c!=EOF)
{
/* ... */
}

This also ensures that object c is initialised before the
comparision.
{
c=getc(dat);
if(c==char(10)) //10 - ascii for new line

This is also bad style. It needlessly restricts your program to
only working on ASCII implementations. Since the file is open
in text mode you can write:

if (c=='\n')
br_redaka=br_redaka+1;
}
//generiranje nasumicnog broja od 0 do broja redaka //moguci BUG zbog
nule!
srand(time(NULL));
rand_br=rand() % br_redaka;

See Q13.16 in the FAQ:

http://www.eskimo.com/~scs/C-faq/q13.16.html

You also need to reset the file pointer here:

fseek( dat, 0, SEEK_SET );
c='b';
while(c!=EOF)
{
/*c=getc(dat);
if(c==char(10)&&brojac==rand_br) //10-ascii kod za novi redak
{
brojac=brojac+1; //rand_br od 0 do br_redaka
fgets(pojam,50,dat);
}*/
c=getc(dat);
if(c==char(10)&&brojac==rand_br)
{
brojac=brojac+1;
while(znak!='\\')
//strcat(pojam,znak);
pojam[brojac-2]=c;

This while loop would appear to be infinite. I don't think it's
even needed.

You need to null terminate the char array to turn it into a valid
string for display:

projam[brojac-1] = '\0';
printf("%s",pojam);

Nick.
 
M

Michael Wojcik

The algorithm I'm thinking of actually needs N distinct
values. Simply knowing that the range of generated numbers
contains more than N values is not good enough, nor is knowing
that the period of the PRNG exceeds N. You actually need a
"no duplicates among the first N values" guarantee.

The "reservoir sampling" algorithm, on the other hand, doesn't need
this guarantee. It just needs an unbiased PRNG (and since one can
be constructed from a biased PRNG, that's pretty easy to do). There
was a nice writeup on RS in _Dr Dobb's Journal_ a while back, but
it's not available free at the DDJ site. No doubt it could be found
on Google with little trouble.

I don't have Knuth v2 here to compare the algorithm you mention with
RS.
 
E

Eric Sosman

Michael said:
The "reservoir sampling" algorithm, on the other hand, doesn't need
this guarantee. It just needs an unbiased PRNG (and since one can
be constructed from a biased PRNG, that's pretty easy to do). There
was a nice writeup on RS in _Dr Dobb's Journal_ a while back, but
it's not available free at the DDJ site. No doubt it could be found
on Google with little trouble.

I don't have Knuth v2 here to compare the algorithm you mention with
RS.

Knuth's description of RS requires distinct values.
Each item enters the reservoir iff its associated random
value is greater than the smallest already present (and
bumps the smallest out). If equality occurs an ambiguity
arises, and it's not clear how to resolve the ambiguity
without introducing a bias.

I don't have DDJ here to compare the algorithm you
mention with RS ;-)
 
A

Arthur J. O'Dwyer

The algorithm I'm thinking of actually needs N distinct
values. [...]

What's wrong with:

N := 0
Line := ""
for each line L in file
if (Rand in [0..N] == 0) then
Line := L
print Line

Simple, easy to code, requires an unbiased RNG that can do random
numbers in the range [0..N] for all N up to the number of lines
in the file.

I don't understand what this means.

Given an RNG with that property, generating numbers between 0 and
65535, and being able to see the first 60000 outputs or so, I'd have
a pretty good edge on predicting the next 5000 outputs. :) It's good
for whatever application you're thinking of, but it would not be a good
idea to make a 'rand()' implementation with this property. ;-)

-Arthur
 
E

Eric Sosman

Arthur said:
The algorithm I'm thinking of actually needs N distinct
values. [...]


What's wrong with:

N := 0
Line := ""
for each line L in file
if (Rand in [0..N] == 0) then
Line := L
print Line

Simple, easy to code, requires an unbiased RNG that can do random
numbers in the range [0..N] for all N up to the number of lines
in the file.

I'm either failing to understand your pseudocode, or
something's missing. The way I read the above, it just
prints the file's final line, because `Rand in [0..0]'
is (I'm guessing) always equal to zero. Is there supposed
to be some adjustment of `N' along the way?

It's also possible you and I are attacking slightly
different problems. I'm trying to choose an unbiased
sample of K lines (K == 1 in the O.P.'s question) from a
file containing N lines (assuming N >= K and N <= some
very large number depending on the PRNG implementation),
in one pass over the file and without foreknowledge of N.
 
M

Michael Wojcik

Knuth's description of RS requires distinct values.
Each item enters the reservoir iff its associated random
value is greater than the smallest already present (and
bumps the smallest out). If equality occurs an ambiguity
arises, and it's not clear how to resolve the ambiguity
without introducing a bias.

I figured it must be something like that, based on the requirement
you stated.
I don't have DDJ here to compare the algorithm you
mention with RS ;-)

I really should dig out the issue in question to provide a complete
description, but basically what it does is:

- Assume the domain being sampled contains at least as many
items (N) as the sample size (M). N is not known ahead of time.

- Populate the reservoir with the first M items from the domain

- Each remaining item is tested as a candidate for inclusion by
generating a random integer value between 0 and I-1, where I is the
total number of items read thus far. If the random value is < M, the
item in that slot in the reservoir is replaced by the current item.
(I hope that's right...)

Items read from the domain early on have a high probability of
entering the reservoir (it's 1 for the first M items), but they
undergo more tests and so are more likely to be bumped than items
that are added later.

The final item has exactly M in N (ie M/N) probability of being
added to the reservoir, since it undergoes only its initial test
for candidacy, and when it does I == N.

For all other items, the cumulative probability that they'll get
into the reservoir, and stay there, also works out to M/N.

For example:

Domain series {a,b,c,d}, reservoir size M=1.

1. a gets into the reservoir with probability 1.
2. a survives b's test with probability 1/2.
3. a survives c's test with probability 2/3 (c has a 1/3 chance).
4. a survives d's test with probability 3/4.

a's cumulative chance: (1)(1/2)(2/3)(3/4) = 6/24 = 1/4.

This algorithm just requires that you have an unbiased generator
that can provide intergers in the range [0,N-1].
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,142
Messages
2,570,819
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top