Intel processors can only process machine language[...] There's no
way for a processor to understand any higher level language, even
assembler, since it is written with hexadecimal codes and basic
instructions like MOV, JMP, etc. The assembler compiler can
convert an assembler file to a binary executable, which the
processor can understand.
This may be true, but I think it's not bad to assume that machine
language and assembler are "almost the same" in this context, since
the translation between them is non-ambiguous (It's
just "recoding"; this is not the case with HLLs).
I have no problem with your explanation. It's nearly impossible to
program in machine code, which is all 1's and 0's. Assembler makes it
infinitely easier by converting the machine 1's and 0's to their
hexadecimal equivalent and assigning an opcode name to them, like
PUSH, MOV, CALL, etc.
Still, the older machine-programmable processors used switches to set
the 1's and 0's. Or, the machine code was fed in on perforated cards
or tapes that were read. The computer read the switches, cards or
tapes, and set voltages according to what it scanned.
the difference is that machine code can be read directly, whereas
assembler has to be compiled in order to convert the opcodes to binary
data.
(Not really -- object code files are composed of header data and
different segments, data and code, and only the code segments are
really meaningful to the processor.)
I agree that the code segments, and the data, are all that's
meaningful to the processor. There are a few others, like interrupts
that affect the processor directly.
I understand what you're saying but I'm refering to an executable file
ready to be loaded into memory. It's stored on disk in a series of 1's
and 0's. As you say, there are also control codes on disk to separate
each byte along with CRC codes, timing codes, etc. However, that is
all stripped off by the hard drive electronics.
The actual file on disk is in a certain format that only the operating
system understands. But once the code is read in, it goes into memory
locations which hold individual arrays of bits. Each memory location
holds a precise number of bits corresponding to the particular code it
represents. For example, the ret instruction you mention below is
represent by hex C3 (0xC3), which represents the bits 11000011.
That's a machine code, since starting at 00000000 to 11111111, you
have 256 different codes available. When those 1's and 0's are
converted to volatges, the computer can analyze them and set circuits
in action which will bring about the desired operation. Since Linux is
written in C, it must convert down to machine code, just as Windows
must.
But you _do_ know that pyc files are Python byte code, and you could
only directly disassemble them to Python byte code directly?
that's the part I did not understand, so thanks for pointing that out.
What I disassembled did not make sense. I was looking for assembler
code, but I do understand a little bit about how the interpreter reads
them.
For example, from os.py, here's part of the script:
# Note: more names are added to __all__ later.
__all__ = ["altsep", "curdir", "pardir", "sep", "pathsep", "linesep",
"defpath", "name", "path", "devnull"]
here's the disassembly from os.pyc:
00000C04 06 00 00 00 dd 6
00000C08 61 6C 74 73 65 70 74 db 'altsept'
00000C0F 06 00 00 00 dd 6
00000C13 63 75 72 64 69 72 74 db 'curdirt'
00000C1A 06 00 00 00 dd 6
00000C1E 70 61 72 64 69 72 74 db 'pardirt'
00000C25 03 00 00 00 dd 3
00000C29 73 65 70 db 'sep'
00000C2C 74 07 00 00 dd 774h
00000C30 00 db 0
00000C31 70 61 74 68 73 65 70 db 'pathsep'
00000C38 74 07 00 00 dd 774h
00000C3C 00 db 0
00000C3D 6C 69 6E 65 73 65 70 db 'linesep'
00000C44 74 07 00 00 dd 774h
00000C48 00 db 0
00000C49 64 65 66 70 61 74 68 db 'defpath'
00000C50 74 04 00 00 dd offset unk_474
00000C54 00 db 0
00000C55 6E 61 6D 65 db 'name'
00000C59 74 04 00 00 dd offset unk_474
00000C5D 00 db 0
00000C5E 70 61 74 68 db 'path'
00000C62 74 07 00 00 dd 774h
00000C66 00 db 0
00000C67 64 65 76 6E 75 6C 6C db 'devnull'
you can see all the ASCII names in the disassembly like altsep,
curdir, etc. I'm not clear as to why they are all terminated with 0x74
= t, or if that's my poor interpretation. Some ASCII strings don't use
a 0 terminator. The point is that all the ASCII strings have numbers
between them which mean something to the interpreter. Also, they are
at a particular address. The interpreter has to know where to find
them.
The script is essentially gone. I'd like to know how to read the pyc
files, but that's getting away from my point that there is a link
between python scripts and assembler. At this point, I admit the code
above is NOT assembler, but sooner or later it will be converted to
machine code by the interpreter and the OS and that can be
disassembled as assembler.
I realize this is a complicated process and I can understand people
thinking I'm full of beans. Python needs an OS like Windows or Linux
to interface it to the processor. And all a processor can understand
is machine code.
No, assembly language source is readable text like this (gcc):
.LCFI4:
movl $0, %eax
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
Yes, the source is readable like that, but the compiled binary is not.
A disaasembly shows both the source and the opcodes. The ret
statement above is a mneumonic for hex C3 in assembler. You have left
out the opcodes. Here's another example of assembler which is
disassembled from python.exe:
1D001250 FF 74 24 04 push [esp+arg_0]
1D001254 E8 D1 FF FF FF call 1D00122A
1D001259 F7 D8 neg eax
1D00125B 1B C0 sbb eax, eax
1D00125D F7 D8 neg eax
1D00125F 59 pop ecx
1D001260 48 dec eax
1D001261 C3 retn
the first column is obviously the address in memory. The second column
are opcodes, and the third column are mneumonics, English words
attached to the codes to give them meaning. The second and third
column mean the same thing.
A single opcode instruction like 59 = pop ecx and 48 = dec eax, are
self-explanatory. 59 is hexadecimal for binary 01011001, which is a
binary code. When a processor receives that binary as voltages, it is
wired to push the contents of the ecx register onto the stack.
The second instruction, call 1D00122A is not as straight forward. it
is made up of two parts: E8 = the opcode for CALL and the rest 'D1 FF
FF FF' is the opcode operator, or the data which the call is
referencing. In this case it's an address in memory that holds the
next instruction being called. It is written backward, however, which
is convention in certain assemblers. D1 FF FF FF actually means FF FF
FF D1.
This instruction uses F's to negate the instruction, telling the
processor to jump back. The signed number FFFFFFD1 = -2E. A call
counts from the end of it's opcode numbers which is 1D001258, and
1D001258 - 2E = 1D00122A, the address being called.
As you can see, it's all done with binary codes. The English
statements are purely for the convenience of the programmer. If you
look at the Intel definitons for assembler instructions, it lists both
the opcodes and the mneumonics.
I would agree with what you said earlier, that there is a similarity
between machine code and assembler. You can actually write in machine
code, but it is often entered in hexadecimal, requiring a hex to
binary interpreter. In tht case, the similarity to compiled assembler
is quite close.
Machine language is binary codes, yes.
if I knew what the intervening numbers meant I could.
By definition, you can read every file in a hex editor ...
Not at all. Again: It's Python byte code. Try experimenting with
pdb.
I will eventually...thanks for reply.