File Merge

  • Thread starter Michael R. Copeland
  • Start date
M

Michael R. Copeland

I'm writing an application that requires an "intelligent merge" of 2
files. That is, equal data has a "preferred source" that I want to
write out. What I have works, I believe, but it seems horribly
cumbersome (having to set the input variables to ""...). Is there a
better way? TIA

while ((!feof(wf3)) || (!feof(wf1)))
{
if (!feof(wf1))
{
strcpy(WDBRec, "");
if (fgets(WDBRec, sizeof(WDBRec), wf1) != NULL)
nic++;
}
if (!feof(wf3))
{
strcpy(DBERec, "");
if (fgets(DBERec, sizeof(DBERec), wf3) != NULL)
bic++;
}
dbeBib = atoi(copy(DBERec, 1, 5));
wdbBib = atoi(copy(WDBRec, 1, 5));
if (wdbBib == dbeBib) // records match - defer to old data
{
if (dbeBib > 0)
{
writeToDBE(DBERec); buc++;
}
}
if (wdbBib < dbeBib) // work file data is new - write it
{
if (wdbBib > 0)
{
writeToDBE(WDBRec); nuc++;
}
else
{
if (dbeBib > 0)
{
writeToDBE(DBERec); buc++;
}
}
}
if (wdbBib > dbeBib)// prevailing old data - write it out
{
if (dbeBib > 0)
{
writeToDBE(DBERec); buc++;
}
}
} // while
 
M

Martijn

Hi,
I'm writing an application that requires an "intelligent merge" of
2 files. That is, equal data has a "preferred source" that I want to
write out. What I have works, I believe, but it seems horribly
cumbersome (having to set the input variables to ""...). Is there a
better way? TIA

while ((!feof(wf3)) || (!feof(wf1)))
{
if (!feof(wf1))
{
strcpy(WDBRec, "");
if (fgets(WDBRec, sizeof(WDBRec), wf1) != NULL)
nic++;
}

[snipped]

Firstly: your indentation has some room for improvement. Secondly, you are
inconsistent with your syntax of single-line if bodies. But here is my
actual reply:

Assuming you are not working with UNICODE, you can simplify

strcpy(WDBRec, "");

to

WDBRec[0] = '\0';

The code doesn't look that cumbersome. You could rewrite it, which may make
it a little bit more straight foreward:

if (fgets(WDBRec, sizeof(WDBRec), wfl) != NULL)
{
nic++;
wdbBib = atoi(copy(WDBRec, 1, 5));
}
else
{
wdbBib = 0;
}

Somewhere further down in your code you use:
if (wdbBib > 0)
{
writeToDBE(WDBRec); nuc++;
}

This will take care of not writing out an invalid string (because WDBRec
will still contain the previous value).

I might have left some loose ties, but this should help you along.

Good luck!
 
C

CBFalconer

Martijn said:
.... snip ...

Firstly: your indentation has some room for improvement.
Secondly, you are inconsistent with your syntax of single-line
if bodies. But here is my actual reply:

His indentation was fine here. This indicates that indentation
swallowing is taking place somewhere on the path to you, but not to
me. It could be your newsreader, which appears to be a Microsoft
execresence.

You neglected to attribute the portion you quoted. Please don't do
that.
 
C

Chris Croughton

His indentation was fine here. This indicates that indentation
swallowing is taking place somewhere on the path to you, but not to
me. It could be your newsreader, which appears to be a Microsoft
execresence.

The original used tabs. Nasty things, they can expand to any number of
spaces including none depending on the system (for me they expanded to 8
spaces, which is excessive but just bearable in that example).

Chris C
 
E

Eric Sosman

Michael said:
I'm writing an application that requires an "intelligent merge" of 2
files. That is, equal data has a "preferred source" that I want to
write out. What I have works, I believe, but it seems horribly
cumbersome (having to set the input variables to ""...). Is there a
better way? TIA

while ((!feof(wf3)) || (!feof(wf1)))

This is the wrong way to check for end-of-file. Please
see Question 12.2 in the comp.lang.c Frequently Asked Questions
(FAQ) list at

http://www.eskimo.com/~scs/C-faq/top.html
{
if (!feof(wf1))

Ditto.
{
strcpy(WDBRec, "");

This looks pointless. You're about to overwrite the contents
of WDBRec by using fgets() on it, so why do you care what's in it
beforehand? Perhaps this is an attempt to rescue the situation
after the unreliable end-of-file test -- if so, once you fix the
test you won't need this any more.

By the way, you didn't show us what WDBRec is. From the way
you're using it, it should be an array of char; a pointer to a
malloc'ed area would not work here.
if (fgets(WDBRec, sizeof(WDBRec), wf1) != NULL)
nic++;
}
if (!feof(wf3))
{
strcpy(DBERec, "");
if (fgets(DBERec, sizeof(DBERec), wf3) != NULL)
bic++;
}
dbeBib = atoi(copy(DBERec, 1, 5));

You haven't shown us what copy() is. I'm going to assume
that it copies the second through sixth characters (that is,
array elements [1] through [5]) into a six-char array somewhere
and appends a '\0'. Whether this works depends a lot on the
location and nature of that intermediate six-char array; see
Question 7.5 for a description of one all-too-frequent error.
(By the way, if fgets() didn't read anything, the second through
sixth characters will be the leftovers from the record prior to
the current one, if any.)

Despite its suggestive name, atoi() is not a very good way
to convert decimal strings to integers, not unless you're very
trusting of the source. The problem is that it will happily
convert "123x5" to 123 and give no indication that the input
is in any way strange. It won't even detect "xyzzy" as in any
way peculiar (indeed, its behavior on "xyzzy" is completely
unpredictable). So unless you are very, very sure that the
input is valid, atoi() is a poor way to convert it. There are
at least three superior ways to proceed:

- Use strtol(), because it will do the conversion *and*
report any oddities it finds, in a predictable way.

- Use sscanf(). It's a little bit trickier than it looks,
but allows you to do without the copy() stuff:
if (sscanf(DBERec+1, "%5d%n", &dbeBib, &len) == 1
&& len == 5) { all's well } else { bad input }
The "%5d" converts no more than five digits (in case
additional digits follow the field of interest). The
"%n" tells you how many digits were actually converted
(it will set len to 3 if the input was "123xy"). And
if the "%5d" finds no digits at all ("xyzzy") sscanf()
will stop and return zero.

- If you're really sure the respective fields contain digits,
you can compare them as characters without converting at
all by using memcmp(DBERec+1, WDBRec+1, 5). However, this
may cause some surprises with non-digits: for example,
"01234" and " 1234" will be treated as unequal, and "-1234"
will be treated as less than "-9999". You'll have to decide
whether this is appropriate for your application.
wdbBib = atoi(copy(WDBRec, 1, 5));
if (wdbBib == dbeBib) // records match - defer to old data
{
if (dbeBib > 0)

I'm not sure what this test is for, unless perhaps it's
part of the rescue attempt for the incorrect end-of-file test.
If there really are actual non-positive numbers in the input,
it looks like this will eliminate them from the output. But
if you've got purely digit fields that can't be negative (though
"00000" would, of course, be zero), I think this test and the
others like it can simply go away once you fix the EOF handling.
{
writeToDBE(DBERec); buc++;
}
}
if (wdbBib < dbeBib) // work file data is new - write it
{
if (wdbBib > 0)
{
writeToDBE(WDBRec); nuc++;
}
else
{
if (dbeBib > 0)
{
writeToDBE(DBERec); buc++;
}
}
}
if (wdbBib > dbeBib)// prevailing old data - write it out
{
if (dbeBib > 0)
{
writeToDBE(DBERec); buc++;
}
}
} // while

You say you believe this works, but one thing that strikes
me as strange is that you read new input from *both* files every
time through the loop. (Until the botched EOF detection kicks
in, of course.) That doesn't seem right at all: If you get the
sequence "11111" "33333" "55555" from WDB while DBE provides
"22222" "44444", I'd expect you'd want to see all five of these
in the output -- but that's not what you're doing, and I'm not
sure whether it's accidental or intentional. Take another look.
 
A

Alan Balmer

His indentation was fine here. This indicates that indentation
swallowing is taking place somewhere on the path to you, but not to
me. It could be your newsreader, which appears to be a Microsoft
execresence.

The OP used tabs instead of spaces. To the OP: Don't do that.
 
M

Martijn

Firstly: your indentation has some room for improvement.
His indentation was fine here. This indicates that indentation
swallowing is taking place somewhere on the path to you, but not to
me. It could be your newsreader, which appears to be a Microsoft
execresence.

You make it sound like it's a bad thing ;) Given your experience you should
have no problem confirming that fact by looking at the headers.
You neglected to attribute the portion you quoted. Please don't do
that.

I am aware of your pedantic approach towards other peoples posting habits
(whether that's a good thing or a bad thing is not up to me), but you lost
me here. Could you rephrase your comment for a non-native English speaking
individual like myself?

But what did you think of the post content-wise?
 
M

Martijn

Firstly: your indentation has some room for improvement.
His indentation was fine here. This indicates that indentation
swallowing is taking place somewhere on the path to you, but not to
me. It could be your newsreader, which appears to be a Microsoft
execresence.

You make it sound like it's a bad thing ;) Given your experience you should
have no problem confirming that fact by looking at the headers.
You neglected to attribute the portion you quoted. Please don't do
that.

I am aware of your pedantic approach towards other peoples posting habits
(whether that's a good thing or a bad thing is not up to me), but you lost
me here. Could you rephrase your comment for a non-native English speaking
individual like myself?

But what did you think of the post content-wise?
 
A

Alan Balmer

You make it sound like it's a bad thing ;)
Yes.

Given your experience you should
have no problem confirming that fact by looking at the headers.

I suppose that's how he knew.
I am aware of your pedantic approach towards other peoples posting habits
(whether that's a good thing or a bad thing is not up to me), but you lost
me here. Could you rephrase your comment for a non-native English speaking
individual like myself?

You quoted parts of other people's writings without any indication of
who the writer was (attribution.) You did it again in this post. Look
at the top of this reply for an example of an attribution (to you).
Look at nearly any other post in this forum for other examples.

A better newsreader would help. Failing that, you should look for a
program called OE-Quotefix, which reportedly makes Outlook Express act
somewhat like a real newsreader.
 
F

Flash Gordon

Martijn wrote:

I am aware of your pedantic approach towards other peoples posting habits
(whether that's a good thing or a bad thing is not up to me), but you lost
me here. Could you rephrase your comment for a non-native English speaking
individual like myself?

<snip>

See at the top where it says "Martijn wrote:"? That's called an
attribution. It tells you who wrote the quoted text. You are deleting
all the attributions so I don't know who wrote, "You neglected to..."

So please leave in the bits saying who wrote what (the attributions) for
all the text still quoted.
 
M

Martijn

[snipped]
You quoted parts of other people's writings without any indication of
who the writer was (attribution.) You did it again in this post. Look
at the top of this reply for an example of an attribution (to you).

Duely noted, thanks for the clarification. I did it again because such is
(or was) my way of doing it, no harm intended. I'll change it, np.
Look at nearly any other post in this forum for other examples.

I'll take your word for it.
A better newsreader would help. Failing that, you should look for a
program called OE-Quotefix, which reportedly makes Outlook Express act
somewhat like a real newsreader.

I did, I use it, and it does.

Still, everyone (except Eric and me) is avoiding the subject of the OP's
message. But then again, all these OT threads are much more fun, right ? :p
Unless netiquette or posting conventions all of a sudden have become
on-subject in this group.
 
M

Michael R. Copeland

while ((!feof(wf3)) || (!feof(wf1)))
This is the wrong way to check for end-of-file. Please
see Question 12.2 in the comp.lang.c Frequently Asked Questions
(FAQ) list at
Indeed, but that's why I wanted suggestions how I can improve this
kludgey logic. Regardless of the normal method to read a file, I
couldn't work out a clean and simple way to loop through both files
while performing the merge. I guess I'd need to see how the
conventional logic applies to my particular problem (and I couldn't find
anything on google on this...).
http://www.eskimo.com/~scs/C-faq/top.html


This looks pointless. You're about to overwrite the contents
of WDBRec by using fgets() on it, so why do you care what's in it
beforehand? Perhaps this is an attempt to rescue the situation
after the unreliable end-of-file test -- if so, once you fix the
test you won't need this any more.
The point of this to to assure that the "atoi" conversion that
follows doesn't produce a valid number from data residing after the file
had been read. Without it, I was getting the final record from one of
the files written to the output twice! It's gotten to the point that
adding logic/tweaking was getting me more and more away from a clean
solution.
By the way, you didn't show us what WDBRec is. From the way
you're using it, it should be an array of char; a pointer to a
malloc'ed area would not work here.
It's a character array (as it should be)...
if (fgets(WDBRec, sizeof(WDBRec), wf1) != NULL)
nic++;
}
if (!feof(wf3))
{
strcpy(DBERec, "");
if (fgets(DBERec, sizeof(DBERec), wf3) != NULL)
bic++;
}
dbeBib = atoi(copy(DBERec, 1, 5));

You haven't shown us what copy() is. I'm going to assume
that it copies the second through sixth characters (that is,
array elements [1] through [5]) into a six-char array somewhere
and appends a '\0'. Whether this works depends a lot on the
location and nature of that intermediate six-char array; see
Question 7.5 for a description of one all-too-frequent error.
No, it is a "substr" function that copies character 1-5 of the input.
I didn't include some stuff I felt was extraneous to the logic, sorry...
 
E

Eric Sosman

Michael said:
Indeed, but that's why I wanted suggestions how I can improve this
kludgey logic. Regardless of the normal method to read a file, I
couldn't work out a clean and simple way to loop through both files
while performing the merge. I guess I'd need to see how the
conventional logic applies to my particular problem (and I couldn't find
anything on google on this...).

Hmmm. The FAQ seems pretty clear -- but then again, I
have the advantage of already knowing the answer. "It's
elementary," as Sherlock always says *after* explaining how
he figured it out.

Okay: End-of-input is only detected by attempting a read
and having it fail. There's no way to ask "Is there any input
left?" without actually attempting an input operation. This
may seem a silly restriction in connection with disk files,
but consider the situation when input is coming from a keyboard
or a network socket or something of the kind: There is no way
to predict that the user is about to strike ^D or ^Z or whatever
the local end-of-input key sequence is, nor is there any way to
predict that the other end of your socket connection is about
to hang up the phone on you. You cannot know what happened
until it actually happens -- so the only way to know that you
have reached end-of-input is to try to read something and get
a failure.

Now, end-of-file is only one of the reasons an input attempt
might fail: for example, input from a disk could fail in the
event of a head crash, or input from a keyboard could fail if
you spilled Coke Classic into the mechanism and shorted it out
with caramelized sugar. Most of C's input functions report a
kind of "generalized failure" no matter what the cause -- and
the *only* reason feof() exists is to let you figure out that
cause. If the function could read no more input because it
detected end-of-file, feof() will be true; if feof() is false,
the failure was something like a bad disk sector (and ferror()
will be true).

Putting this all together, you should write code that looks
something like

if (fgets(fgets(WDBRec, sizeof(WDBRec), wf1) == NULL) {
/* Woops! Couldn't get any more input. Why not? */
if (feof(wf1)){
/* Aha! We've reached end-of-file, and should
* remember the fact and not try to read any more.
*/
}
else {
/* Oh, woe! Oh, woe! The Nazgul have eaten
* the disk controller, and are even now taking
* the Token Ring back to Sauron! If I were to
* test ferror(wf1) at this point it would return
* true, confirming my worst fears.
*/
}
}
else {
/* Whoopee! I got some input data! */
}

I hope you understand by now that it is pointless to call
feof() or ferror() *before* an input operation; they only make
sense after an input operation has already failed.
>> You haven't shown us what copy() is. I'm going to assume
>> that it copies the second through sixth characters (that is,
>> array elements [1] through [5]) into a six-char array somewhere
>> and appends a '\0'. Whether this works depends a lot on the
>> location and nature of that intermediate six-char array; see
>> Question 7.5 for a description of one all-too-frequent error.
>
> No, it is a "substr" function that copies character 1-5 of
> the input. I didn't include some stuff I felt was extraneous
> to the logic, sorry...

Um, er, that's exactly how I described it, is it not? And
all the things I said about it (including the Q7.5 reference and
the stuff about repeating the final record indefinitely after
EOF) still hold.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,169
Messages
2,570,915
Members
47,456
Latest member
JavierWalp

Latest Threads

Top