how to parse an executable in C and find out if there is any return(RET in assembly) or not

W

Walter Roberson

Could you elaborate on what is "The Halting Problem". I'm not able to
remember it, but I did hear of it some time back.

Suppose you have program source X available as input ("source" can
include machine language in this context), and you want to determine
whether that program will eventually terminate for all possible inputs
to X, or if X instead might infinite loop on -some- possible input.
Can you write a program H which will read the source X
and tell you whether X will always terminate (or to use another term,
whether X will always halt)?

If you are given X ahead of time, you might be able to write your
halting test program H to determine whether that one program X will
ever halt, and that same program H might work for some other programs
that are very similar to X. With a really clever program H and a big
enough computer, you might even be able to handle a lot of different
programs -- you might even be able to properly answer the question
for trillions of different possible input programs. It turns out,
though, that it is IMPOSSIBLE to write your halting test program H
to handle *all* possible input programs.


If you had an executable image and you knew the exact format of the
image, and if the format told you exactly where the instruction
sequence started and stopped, and if the machine/OS strictly disallowed
executing data, and if you could *reliably* tell what all the
instructions were (i.e., it was somehow impossible to branch into the
"middle" of an instruction), then you could do a mechanical search to
determine whether there was a RETURN instruction somewhere in the
executable stream... but figuring out whether some RETURN will always
be -reached- is *much* harder.

If the machine/OS does allow data to be executed, then again the
problem becomes hard, because you would need to determine whether the
program ever -writes- a RETURN instruction into a location that it will
definitely execute.

If the machine allows you to branch into the middle of an instruction
byte sequence and the machine allows branch addresses to be calculated
then the problem is difficult again, because you would have to figure
out whether a byte sequence that happens to have the same value as
the RETURN instruction might ever be reached by an unusual branch.
For example, if the RETURN instruction is 83 and if "left shift
address register #8 by 3 bits" happens to be 27 83, then the 83
is just "random chance" if you execute in a straight forward manner,
but if there happens to be a branch to the byte past the 27 then
suddenly there's your RETURN instruction. (And it's not as simple
as that, even -- the branch might be to an unexpected boundary somewhere
hundreds of bytes before and all of the reinterpreted instructions
just happen to come out such that the 27 is part of a previous
instruction in the new execution, eventually leading to the 83 RETURN...)


To expound a bit further: your original question was wide open
and did not place any restrictions as to specific executable format,
machine language, amount of machine memory, available hard disk space,
and so on. That made the question equivilent to asking, "Is it
always possible to do this?", and as I discussed briefly above, the
answer to "always possible?" is NO, even though you might be able to
answer properly for trillions of different executables.

If you pin down to one *particular* set of machine specifications, and
place a fixed pre-determined limit on the amount of memory available to
the program that is to be tested, then the answer reverses itself and
becomes YES: for any pre-specified fixed-resource deterministic
computer system, you *can* always answer the question of whether a
given input program {that is within the pre-specified resource limits}
will always halt or not.

The difficulty, though, is that if the program to be examined is
allowed to use M resource slots (e.g., M bytes of memory including
machine registers), and if each of the resource slots can have S
different states (e.g., each 8-bit byte can represent any of 256
different values), then the resources needed for the testing program
would be at least S**(M-1) resource slots where ** indicates
exponentiation. For example, if the programs to be tested were
restricted to a mere 64 Kbyte of memory, then the testing program would
need access to at least 8**65535 bytes of memory... which is a number
more than 59000 digits long. And there isn't that much memory in the
universe :( [If you were to use every elementary particle in the
universe and were able to extract 2 bits of information from each
of them, then you would be able to run halting tests on programs
that had access to about 34 bytes of memory. With a gigabyte of
memory, you'd be able to run halting tests on programs that
had access to about 12 bytes of memory.]
 
P

priyanka

Hi All,

Thanks for your reply.
I was just asking how do people dissassemble it ? I know that after the
binary is dissambled, we can get the objdump and I can find out from
there whether my program has RET(or any other instruction) or not. But
my question is how do people disassemble it ? Now, I know the commands
to dissamble the binary but I was asking for the logic ? Like, how can
they get the dissambled code with all the right instructions by only
looking at the binary ? Because for me all the binaries look like same
:). I wanted to know the overall algorithm or logic behind it.

Thank you,
Priyanka
 
K

Keith Thompson

priyanka said:
I was just asking how do people dissassemble it ? I know that after the
binary is dissambled, we can get the objdump and I can find out from
there whether my program has RET(or any other instruction) or not. But
my question is how do people disassemble it ? Now, I know the commands
to dissamble the binary but I was asking for the logic ? Like, how can
they get the dissambled code with all the right instructions by only
looking at the binary ? Because for me all the binaries look like same
:). I wanted to know the overall algorithm or logic behind it.

Please don't top-post. Read <http://www.caliburn.nl/topposting.html>.

Your question has nothing at all to do with the C programming
language. I'm not sure what would be an appropriate place to ask it
(probably in a newsgroup for your platform), but this isn't it.
 
W

Walter Roberson

I was just asking how do people dissassemble it ? I know that after the
binary is dissambled, we can get the objdump and I can find out from
there whether my program has RET(or any other instruction) or not. But
my question is how do people disassemble it ? Now, I know the commands
to dissamble the binary but I was asking for the logic ? Like, how can
they get the dissambled code with all the right instructions by only
looking at the binary ? Because for me all the binaries look like same
:). I wanted to know the overall algorithm or logic behind it.

On most systems, an executable program is a very structured binary
file. One part of the information will be either an offset to
where the code begins, or else a marker saying "This next bit is code".

Once the location of the code is known, the disassembler program
just essentially emulates the way the CPU itself would decode
instructions, except that it does not actually take the action
specified by the instruction: instead it just prints out
a representation of the instruction and moves on to the next one.

The actual decoding of instructions can often be handled by lookup
tables that consist of masks (to be AND'd with the instruction)
and comparison values -- when the AND'd value compares equal then
you have found the right decoding class (i.e., which instruction
-format- is being used) and you can then use more specific logic
to get the rest of the information.
 
S

santosh

Please don't top-post. Your reply should follow, or be interspersed
with, the material you quote.
Hi All,

Thanks for your reply.
I was just asking how do people dissassemble it ? I know that after the
binary is dissambled, we can get the objdump and I can find out from
there whether my program has RET(or any other instruction) or not. But
my question is how do people disassemble it ? Now, I know the commands
to dissamble the binary but I was asking for the logic ? Like, how can
they get the dissambled code with all the right instructions by only
looking at the binary ? Because for me all the binaries look like same
:). I wanted to know the overall algorithm or logic behind it.

Why don't you read Walter Roberson's excellent article on the halting
problem elsethread. Essentially, disassembly reduces to the halting
problem. Now while you can disassemble an object file to a set of
mnemonics easily enough, deciding whether the result is "correct", i.e.
is semantically equivalent to the original object code, is a *far*
harder proposition.

Moreover, disaasembly is a general programming question. Unless you
have a C related question, please post in a more appropriate newsgroup
like comp.programming or alt.lang.asm. If you'll be disassembling x86
object files, comp.lang.asm.x86 is also a better place. Over the years,
there has been a lot of discussion on this issue in alt.lang.asm.

The bottom line is, unless you're very well versed with your target's
instruction set architechture, assembly language, code flow analysis
and so on and so on, (which you don't appear to be), it would be
preferable to use an existing disassembler like IDAPro etc.

Use Google to search for "Disassembly Engines", "Disassemblers",
"Halting problem" etc.
 
G

Gordon Burditt

On most systems, an executable program is a very structured binary
file. One part of the information will be either an offset to
where the code begins, or else a marker saying "This next bit is code".

It's often not that simple. The (read-only) code section may contain
a bunch of other stuff that is not code but is still in the code
section: string literals, branch tables, floating-point constants
on machines that don't have a "load double immediate" instrution,
etc. Some compilers put this in a separate section, (read-only
data vs. read-only code) some don't. On some machines there is
an advantage in having the other stuff "near" the code that uses
it (a short offset is used).
Once the location of the code is known, the disassembler program
just essentially emulates the way the CPU itself would decode
instructions, except that it does not actually take the action
specified by the instruction: instead it just prints out
a representation of the instruction and moves on to the next one.

This approach often generates a big confusing mess, largely due to
decoding string literals and other data as machine code.

I once wrote a Z80 multi-pass disassembler that tried to trace code
execution paths. It knew about things like conditional and
unconditional branches, and it decoded things as instructions only
that stuff that were referenced as instrutions. It also noticed
locations being referenced as 2-byte or 4-byte data values and
displayed them appropriately. It tried to identify references to
strings. Unfortunately, a lot of stuff was still not decoded because
it didn't understand well the many different variants of code
generated through C's switch construct or calling through a function
pointer, and you also had to give it "hints" about entry points to
the program that weren't standard. For other machine architectures,
you might have to track fixed values in registers loaded a few
instructions before to actually figure out where a memory reference
was going.

Then there were the odd constructs to deliberately trip up
disassemblers, for example, on the 8080/Z80:

label0: db 0xc3 ; fool disassemblers
label1: in 5

If you disassemble this starting at label0:, you get a branch instrution
which is 3 bytes long. But something branches *into* the middle
of the instruction, it's an I/O instruction. And then there's self-modifying
code. On the 8080 to do I/O to a variable port number, you *had* to
modify the instruction or make a giant switch for all possible port numbers.
The actual decoding of instructions can often be handled by lookup
tables that consist of masks (to be AND'd with the instruction)
and comparison values -- when the AND'd value compares equal then
you have found the right decoding class (i.e., which instruction
-format- is being used) and you can then use more specific logic
to get the rest of the information.

Gordon L. Burditt
 
M

Malcolm

Walter Roberson said:
If the file format and OS and language restrictions are such
that it is possible to place into execution a section marked as
data, then figuring out whether there is a machine return instruction
or not is equivilent to solving The Halting Problem. I believe it
is generally agreed that The Halting Problem is not solvable in
standard C.
You mean place into data a section marked as execution.
Once you've done that, you can run a C program on the executable file to
find out all sorts of interesting things, provided you know the format.
Maybe if you don't know the format if you are a cryptologist of professional
hacker.
The program can be as portable as makes no difference. Technically you could
run on a file system that can't handle the byte size of the original.
 
C

CBFalconer

Gordon said:
.... snip ...

On the 8080 to do I/O to a variable port number, you *had* to
modify the instruction or make a giant switch for all possible
port numbers.

No you didn't. You could write and execute a baby routine on the
stack, consisting of "in port; ret; nop". The whole thing was
completely reentrant.

You have neglected to include attributions for material you
quoted. Please do not do this again. It is tantamount to
plagiarism.

--
Some informative links:
http://www.geocities.com/nnqweb/
http://www.catb.org/~esr/faqs/smart-questions.html
http://www.caliburn.nl/topposting.html
http://www.netmeister.org/news/learn2quote.html
 
G

Gordon Burditt

On the 8080 to do I/O to a variable port number, you *had* to
No you didn't. You could write and execute a baby routine on the
stack, consisting of "in port; ret; nop". The whole thing was
completely reentrant.

That's still self-modifying code. It's going to be very, very
difficult to get a disassembler to trace through that.
You have neglected to include attributions for material you
quoted. Please do not do this again. It is tantamount to
plagiarism.

attribution = misattribution, as it's been made apparent recently
in this newsgroup that there is not universal agreement about what
the attributions mean.

Gordon L. Burditt
 
D

David Wade

priyanka said:
Hi,

I was wondering if we could parse or do something in the executable(
whose source language was C). How can I use some scripting language
like perl/python to find out the information about the executable ? Is
it possible ?

In general, for compiled languages, the original source language of an
executable usually has little influence on the final executable.
Its often possible to combine modules written in several languages.
Also, how does the compiler add inling to the program ?

Basically it creates a copy of the procedure...
I know that
whenever it sees"inline" in front of the procedure name, it inlines it.
But if we give the -finline options, it inline all the procedures ?

I doubt if it really inlines all procedures....
How
does it do that ? does it parse ? Is there any good book or article
that I can refer to ?

Probably all very techy,
I need to build a small compiler ? Can I build it?

It depends on what you mean by "small". To get small you have to take things
out. Probably easiest to adapt one of the many "small c" or "tiny c"
varients...
Thanks for reading all of my questions :)

Thats ok, it passes the time,
-priyanka

dave
 
F

Flash Gordon

Gordon Burditt wrote:

attribution = misattribution, as it's been made apparent recently
in this newsgroup that there is not universal agreement about what
the attributions mean.

Only you seem to think so, everyone else who expresses an opinion wants
attributions. There have been one or two cases where mistakes have been
made, yes, but none where it was not sorted out quickly.
 
C

CBFalconer

Gordon said:
That's still self-modifying code. It's going to be very, very
difficult to get a disassembler to trace through that.

No it isn't. No code is ever modified. It is written. The trick
does require that you can execute code from the stack, which is no
problem with the 8080.
attribution = misattribution, as it's been made apparent recently
in this newsgroup that there is not universal agreement about what
the attributions mean.

As I pointed out, quoting without attribution is effectively
plagiarism. Every newsreader of which I am aware will construct
the attribution lines automatically. There is no need to steal the
words of others.

--
Some informative links:
http://www.geocities.com/nnqweb/
http://www.catb.org/~esr/faqs/smart-questions.html
http://www.caliburn.nl/topposting.html
http://www.netmeister.org/news/learn2quote.html
 
K

Keith Thompson

attribution = misattribution, as it's been made apparent recently
in this newsgroup that there is not universal agreement about what
the attributions mean.

There is nearly universal agreement. Without you, there would be
universal agreement.

You're wrong, and the rest of us are right.

Last time we discussed this, you were asked several questions. You
never answered them. My tentative conclusion is that you just don't
know what you're talking about, but I'd be glad to be proven wrong.
 
W

Walter Roberson

You mean place into data a section marked as execution.

No I don't. To be able to place a memory section into execution
means that it is possible to (somehow) have the CPU use the
memory section as the source of instructions that are directly
executed. To use other phrasing, to be able to branch into the data.

Being able to branch into data is possible on -most- systems, but
the memory management controls on -some- systems are able to
designate some memory pages as non-executable (in a manner very
similar to being able to mark memory pages as non-writable.)
On such systems, if an attempt is made to branch into the data,
the system detects the attempt and faults the program.

[And there's also the Harvard architecture, in which code and
data are in seperate memory address spaces.]

Once you've done that, you can run a C program on the executable file to
find out all sorts of interesting things, provided you know the format.

If you are able to fopen() the executable itself in binary read
mode, you don't need to "place into data a section marked as
execution". I was talking about something else completely: that
if the program is able to start executing data, then you cannot
in the general case statically prove that the program will not
at -some- point synthesis a RETURN instruction to execute.
 
M

Malcolm

Walter Roberson said:
If you are able to fopen() the executable itself in binary read
mode, you don't need to "place into data a section marked as
execution". I was talking about something else completely: that
if the program is able to start executing data, then you cannot
in the general case statically prove that the program will not
at -some- point synthesis a RETURN instruction to execute.
I misunderstood.
The post is clear now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,520
Latest member
KrisMacono

Latest Threads

Top