Help: String search in Windows 2000 doesn't find text in Windows

B

Barry Millman

Hi:

I am using Perl 5 (I believe both machines are using ActivePERL 5) on
two machines with the same data files. One machine is Win 2000 the
other is Win XP. The files are MS Word 2000 documents e-mailed
(manually) from the Win 2000 machine to the XP machine.

The program searches the MS Word Files (both created with MS Word 2000)
for the word HYPERLINK. The format for the HYPERLINK that I am
searching for in the document is:

HYPERLINK "mydoc.doc"

(I checked this on the XP machine in Notepad and it is OK.)

PROBLEM: The program works on the Windows 2000 machine, but does not
find the files on the Win Xp machine.

The code that is not finding the text on the Win XP machine (same as
the Win 2000 machine which does find the test)is:

----------- start actual code segment --------------------
while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
matches

{
$fndxx = $1;

$fndxx =~ s/\"//; # remove leading quote
$fndxx =~ s/\s+//; # remove leading spaces
$dir="C:\\IGINproducts\\UserDocuments\\";

$fullname = ($dir . $fndxx);
$date_string = "Cannot Find";
if (-e $fullname) { $date_string = ctime(stat($dir .
$fndxx)->mtime); } #last update date of that file
print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
"\n") ;
$matches += 1; # count matches

} #end while HYPERLINK
----------- end actual code segment --------------------

The output for a found HYPERLINK should look like this (it does on the
Win 2000 machine):

mydoc.doc,(date of last update), in: otherdoc.doc

On Win XP, the program cannot even find the word HYPERLINK (if I modify
the code to just search for that). The directories are valid, I can
have the program print a list of all files as it processes them.

If I try this with a test program (the string to test is in the program
itself ) it works fine on the XP machine.

There are no encryption issues, nor any file or directory problems.

I would really appreciate any comments or suggestions about what I am
doing wrong.

Thanks,

Barry Millman
 
B

Barry Millman

Just some added info:

The search works fine if I save the MS Word files as RTF.

Also I wanted to mention that I have this around the hyperlink search code:
#open the file
open(INFILE,"< $file") or die "Couldn't open file ",$file;


while(<INFILE>)
{
# the hyperlink code I posted earlier
} # end while infile

Barry
 
T

Tad McClellan

Barry Millman said:
The format for the HYPERLINK that I am
searching for in the document is:

HYPERLINK "mydoc.doc"
PROBLEM: The program works on the Windows 2000 machine, but does not
find the files on the Win Xp machine.


I don't think I can help with that part, but the code is too hokey
to just let it pass...

----------- start actual code segment --------------------
while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
matches


The //m does not do anything, so why is it there?

It changes the meaning of ^ and $, but you don't use those
anchors in your pattern, so you don't need //m.

.{1,80}?

is the same as

.{0,80}

Do you really want to match ' .doc' ?


We can't help you analyse why the match is failing because we
need two things to do that: the pattern and the string that
the pattern is to be matched against.

We have only one of those two things...

{
$fndxx = $1;

$fndxx =~ s/\"//; # remove leading quote
$fndxx =~ s/\s+//; # remove leading spaces


Why capture them only to strip them out of the captured string?

Why not just leave them out of the capture in the first place?


while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)

or, probably better:

while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)

$dir="C:\\IGINproducts\\UserDocuments\\";


Use single quotes unless you want to make use of one of the two
extra things that double quotes give you (interpolation
and backslash escapes).

Use forward slashes instead of silly slashes unless the path
is going to be fed to the "command interpreter".


$dir='C:/IGINproducts/UserDocuments/';

print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
"\n") ;


Gak!

Use double quoted strings to concatenate your output string:

print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;

If I try this with a test program (the string to test is in the program
itself ) it works fine on the XP machine.


If you had shown us your complete test program, then we could
have helped you debug it.

But you didn't, so we can't. (hint)

I would really appreciate any comments or suggestions about what I am
doing wrong.


Not posting a short and complete program that we can run that
illustrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?
 
B

Barry Millman

Hi:

I tried your suggestions, but no luck. I did nove that directory
assignment outside the loop. Stupid of me!

There is something really odd in MS Word storage in Win XP. If I save
the document to RTF it finds the stuff in the RTF file.

I looked at both the MS Word and RTF files with the XVI32 Hex editor.
They both showed the same hex values for the string HYPERLINK.

Barry
 
B

Barry Millman

OK. Sorry about the bad code. However, let's reduce this to the
minimum, removing the search for the text. All we will do is read
chunks of data, with this program:

-------------------- start of program --------------------------
open (TEST, "c:\\PERL\\Barry\\Starthere.rtf") || die "File Open Failed: $!";

while (<TEST>)
{

print( "Chunk length: ", length($_),"\n");
$chunks += 1;
}

close (TEST) || die "File Close Failed $!";

print( $chunks, " Chunks\n");
-------------------- end of program --------------------------

Now, if I run this using Starthere.rtf, I get 1544 Chunks and they have
all sorts of different lengths. Some of the first chunks are of length:
103, 218, 250,1,230,63, 255.

However, if I run this using Starthere.doc, I get only ONE chunk, and it
is of length 6 bytes.

If I examine the MS Word file using a Hex editor, I get the following
values for bytes 5 through 7 (calling the first byte as zero):
B1 1A E1

The 1A is the seventh byte of the file.

The PERL program (above) seems to stop at this character.

So forgetting about the search, does this yield any clues?

Thank you,

Barry




Tad said:
The format for the HYPERLINK that I am
searching for in the document is:

HYPERLINK "mydoc.doc"

PROBLEM: The program works on the Windows 2000 machine, but does not
find the files on the Win Xp machine.



I don't think I can help with that part, but the code is too hokey
to just let it pass...


----------- start actual code segment --------------------
while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
matches



The //m does not do anything, so why is it there?

It changes the meaning of ^ and $, but you don't use those
anchors in your pattern, so you don't need //m.

.{1,80}?

is the same as

.{0,80}

Do you really want to match ' .doc' ?


We can't help you analyse why the match is failing because we
need two things to do that: the pattern and the string that
the pattern is to be matched against.

We have only one of those two things...


{
$fndxx = $1;

$fndxx =~ s/\"//; # remove leading quote
$fndxx =~ s/\s+//; # remove leading spaces



Why capture them only to strip them out of the captured string?

Why not just leave them out of the capture in the first place?


while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)

or, probably better:

while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)


$dir="C:\\IGINproducts\\UserDocuments\\";



Use single quotes unless you want to make use of one of the two
extra things that double quotes give you (interpolation
and backslash escapes).

Use forward slashes instead of silly slashes unless the path
is going to be fed to the "command interpreter".


$dir='C:/IGINproducts/UserDocuments/';


print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
"\n") ;



Gak!

Use double quoted strings to concatenate your output string:

print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;


If I try this with a test program (the string to test is in the program
itself ) it works fine on the XP machine.



If you had shown us your complete test program, then we could
have helped you debug it.

But you didn't, so we can't. (hint)


I would really appreciate any comments or suggestions about what I am
doing wrong.



Not posting a short and complete program that we can run that
illustrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?
 
B

Bob Walton

Barry said:
Hi:

I am using Perl 5 (I believe both machines are using ActivePERL 5)
on two machines with the same data files. One machine is Win 2000 the
other is Win XP. The files are MS Word 2000 documents e-mailed
(manually) from the Win 2000 machine to the XP machine.

The program searches the MS Word Files (both created with MS Word
2000) for the word HYPERLINK. The format for the HYPERLINK that I am
searching for in the document is:

HYPERLINK "mydoc.doc"

(I checked this on the XP machine in Notepad and it is OK.)

Note that MS Word documents are stored in a proprietary binary
gibberish format. To assume that a given word in a document will
actually always be stored in an ASCII string in the .doc file is
assuming too much. For example, perhaps it is stored in Unicode?
And maybe newer Notepad versions understand enough to present
Unicode strings? Try looking at your files with an editor that
you *know* won't munge the contents. I suggest VIM.

It is a mystery why a document would get changed while emailing
it from one system to another. Or did you perhaps open the
document with Word after emailing it, and then save it? You
don't say. Is it the same version of Word? And what email
system are you using on each of the computers? Does the same
thing happen if you zip the file, email the zipped version, and
unzip it on the other system?
PROBLEM: The program works on the Windows 2000 machine, but does not
find the files on the Win Xp machine.

The code that is not finding the text on the Win XP machine (same as
the Win 2000 machine which does find the test)is:

----------- start actual code segment --------------------
while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
matches

As others have mentioned, the /m modifier does nothing, and the
..{1,80}? would be better as .{0,80} .
{
$fndxx = $1;

$fndxx =~ s/\"//; # remove leading quote

Your comment doesn't match the regex -- it will remove the first
quote, not a leading quote.
$fndxx =~ s/\s+//; # remove leading spaces

Again, this will remove the first run of whitespace from the
string, not leading whitespace.
$dir="C:\\IGINproducts\\UserDocuments\\";

$fullname = ($dir . $fndxx);
$date_string = "Cannot Find";
if (-e $fullname) { $date_string = ctime(stat($dir .
$fndxx)->mtime); } #last update date of that file
print(OUTFILE $fndxx,",",$date_string,", in:
",basename($file), "\n") ;
$matches += 1; # count matches

} #end while HYPERLINK
----------- end actual code segment --------------------

The output for a found HYPERLINK should look like this (it does on the
Win 2000 machine):

mydoc.doc,(date of last update), in: otherdoc.doc

On Win XP, the program cannot even find the word HYPERLINK (if I modify
the code to just search for that). The directories are valid, I can
have the program print a list of all files as it processes them.

If I try this with a test program (the string to test is in the program
itself ) it works fine on the XP machine.

There are no encryption issues, nor any file or directory problems.

How exactly do you know this? Using a piece of garbage like
Notepad won't definitively tell you this. I would trust Perl
much further than Notepad.
....
 
F

foo bar baz qux

Purl said:
Purl Gurl wrote:

Isn't talking to yourself the first sign?

I have looked over Word Perfect and MS Word but not RTF formats, on a
9.x machine, a 2K machine and an XP machine.

Somewhat irrelevant because the OP wrote " The files are MS Word 2000
documents e-mailed (manually) from the Win 2000 machine to the XP
machine."


A hex editor will display plaintext format, if in a binary file. I use
Hex Workshop v. 2.2x for this. Very old program but works with
excellence. You could simply open your Word document with a
hex editor, then search for http: from there.

Pay attention Kira, the OP already wrote "I looked at both the MS Word
and RTF files with the XVI32 Hex editor. They both showed the same hex
values for the string HYPERLINK."


Its so sad to see an old rusty V8 that's only running on three
cylinders.
 
F

foo bar baz qux

Purl said:
I have. You have not.

The OP wrote about MS Word and you entertained him with a pointless and
inconclusive story about an unrelated product: WordPerfect. After he
wrote about using a hex editor you advised him to use a hex editor.
 
F

foo bar baz qux

Purl said:
Barry Millman wrote:

(snipped)




Possible false end of file (eof) signal

"Possible"? Don't be such an unassertive wimp Kira, it is well known
that control-Z (hex 1A) *is* the end of file marker for text files on
MS-DOS and hence (for compatibility reasons) on Win32..

Perl uses the OS for file I/O and it is inevitable that Windows stops
reading your binary file prematurely unless you tell it to use binary
mode.
 
B

Barry Millman

Well Purl Gurl you are the BEST!!!!!

The binmode solved the problem.

Thank you all for your help. Plese don't fight!

It still seems strange that the same file, created by the same word
processor (Word 2000) would behave differently on two diffent versions
of the same OS.

Thanks to Bill Gates and his team for a wonderful morning.

All the best,

Barry
 
T

Tad McClellan

Barry Millman said:
OK. Sorry about the bad code.


Please do not send stealth Cc's.

That is considered a rude practice, so I'm moving on to
someone else's post...
 
A

A. Sinan Unur

Please do not send stealth Cc's.

That is considered a rude practice, so I'm moving on to
someone else's post...

Well, he seems to have found a good match (see elsethread) ;-)

Sinan
 
R

robic0

Note that MS Word documents are stored in a proprietary binary
gibberish format. To assume that a given word in a document will
actually always be stored in an ASCII string in the .doc file is
assuming too much. For example, perhaps it is stored in Unicode?
And maybe newer Notepad versions understand enough to present
Unicode strings? Try looking at your files with an editor that
you *know* won't munge the contents. I suggest VIM.

It is a mystery why a document would get changed while emailing
it from one system to another. Or did you perhaps open the
document with Word after emailing it, and then save it? You
don't say. Is it the same version of Word? And what email
system are you using on each of the computers? Does the same
thing happen if you zip the file, email the zipped version, and
unzip it on the other system?
[--snip--]

Yeah, "propriatory binary" thats a phrase you don't hear much.
Comparing md5's or even checksums should resolve transmission
or open/save issues between versions/machines. Email? Maybe the AV
firewall did some elective stripping somewhere en-route.
You wasted your time on this, you should have tried to code
to discerne the "difference" between saves. In reality
thats what your are trying to do. Just because you can "see" some
discernable text sometimes doesen't mean its a text stream.
You can type out a ".exe" file too. What are the odds it reads
everything to the eof sequence? Pretty good. What are the odds
its got thousands of them in the file? Pretty good. Why?
You can't reliably code for strings in a binary stream unless
you already know the format and read the entire thing into
waiting structures. By that time your past stream processing.
Why do you think xml was invented, or yenc or uucp? Control
codes munge up stream processing. The binary file data are
sometimes control codes when read by consoles, editors and the like.

There is no solution to the OPs problem, there is none.
The approach is wrong. He made what engineers call "conceptual error".
"It worked once" is not proof of concept! Given binary structured data
files, it is absolutely, positively, impossible to treat it as
streaming text in ANY search capacity, unless controls can be
discerened from data at the search core api routines, and thats
not what it does. You can't monitor or change fast enough the api
concept of control codes. The attempt is a bridge to nowhere..
Its a good bridge but the traffic drives off the end.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top