counting number of occurrences of every possible substring in multiple files

C

C3

I am trying to write a program that reads multiple files and prints out the
number of occurrences of n-length byte sequences across these files. the
value of n must be specified on the command-line.

Since I'll be dealing with binary files, I want the ASCII codes of the
characters printed out.

e.g. for n=2 and the following 3 files, contents shown as integers,

f1 = {33, 84, 55}, f2 = {84, 55, 12}, f3 = {33, 84, 55}

I want output like this:
3 84 55
2 33 84

I'll be dealing with files up to about one megabyte in size. Efficiency is
not critical, and it does not matter, say, if a length-2 sequence is a
substring of a length-3, or a more frequently occurring sequence. Values of
n will not go above 10.
 
T

Tad McClellan

C3 said:
I am trying to write a program that reads multiple files and prints out the
number of occurrences of n-length byte sequences across these files. the
value of n must be specified on the command-line.

Since I'll be dealing with binary files,


perldoc -f binmode

I want the ASCII codes of the
characters printed out.


Huh?

If it is a text file, then it contains ASCII codes.

If it is a binary file, then it may contain some other encoding.

Anyway,

perldoc -f chr
perldoc -f ord

e.g. for n=2 and the following 3 files, contents shown as integers,

f1 = {33, 84, 55}, f2 = {84, 55, 12}, f3 = {33, 84, 55}

I want output like this:
3 84 55
2 33 84

I'll be dealing with files up to about one megabyte in size. Efficiency is
not critical, and it does not matter, say, if a length-2 sequence is a
substring of a length-3, or a more frequently occurring sequence. Values of
n will not go above 10.


Did you mean to ask a question?

What is it that you need help with?

Are you asking for someone to write a program to your specification
for you? It kind of sounds that way...
 
P

Paul Lalli

C3 said:
I am trying to write a program that reads multiple files and prints out the
number of occurrences of n-length byte sequences across these files. the
value of n must be specified on the command-line.

Since I'll be dealing with binary files, I want the ASCII codes of the
characters printed out.

e.g. for n=2 and the following 3 files, contents shown as integers,

f1 = {33, 84, 55}, f2 = {84, 55, 12}, f3 = {33, 84, 55}

I want output like this:
3 84 55
2 33 84

I'll be dealing with files up to about one megabyte in size. Efficiency is
not critical, and it does not matter, say, if a length-2 sequence is a
substring of a length-3, or a more frequently occurring sequence. Values of
n will not go above 10.

Do you realize that no where in here did you ask a question? What is it
you need help with? What part are you stuck on? What have you tried so
far, and how did your attempt fail to work correctly?

Paul Lalli
 
B

Brian McCauley

Tad said:
perldoc -f binmode






Huh?

If it is a text file, then it contains ASCII codes.

Er that's not the general case. If it's a text file it contains
seqences of bytes that encode codepoints in some character set encoding.

With any luck there's a mapping from that encoding onto Unicode.

If you are even luckier you'll be able to simply tell Perl about the
encoding and and it will read the file as a series of Unicode code points.

In the simplest case an ASCII text filecontains only bytes with values
0x00-0x7F which directly encode the corresponding codepoints in ASCII.

But is not the default case for Perl to assume assume text files are
utf8? (Of course any well-formed ASCII text file is also a well-formed
utf8 text file).
If it is a binary file, then it may contain some other encoding.

If it's a binary file then it's a series of bytes. These bytes may to
may not encode characters.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,850
Latest member
VMRKlaus8

Latest Threads

Top