translating Python to Assembler

C

Chris Mellon

Really? How do you do that?

I thought it might be compile(), but apparently not.

There are tools for it in the undocumented compiler.pyassem module.
You have to pretty much know what you're doing already to use it - I
spent a fun (but unproductive) week figuring out how to use it and
generated customized bytecode for certain list comps. Malformed
hand-generated bytecode stuffed into code objects is one of the few
ways I know of to crash the interpreter without resorting to calling C
code, too.
 
C

Christian Heimes

Paul said:
Well, it is important to make distinctions when people are wondering,
"If Python is 'so slow' and yet everyone tells me that the way it is
executed is 'just like Java', where does the difference in performance
come from?" Your responses seemed to focus more on waving that issue
away and leaving the whole topic in the realm of mystery. The result:
"Python is just like Java apparently, but it's slower and I don't know
why."

Short answer: Python doesn't have a Just In Time (JIT) compiler. While
Java's JIT optimizes the code at run time Python executes the byte code
without additional optimizations.

Christian
 
G

Grant Edwards

Once a python py file is compiled into a pyc file, I can disassemble
it into assembler.

No you can't. It's not native machine code. It's byte code
for a virtual machine.
Assembler is nothing but codes, which are combinations of 1's
and 0's. You can't read a pyc file in a hex editor, but you
can read it in a disassembler.

NO YOU CAN'T.
It doesn't make a lot of sense to me right now,

That's because IT'S NOT MACHINE CODE.
but if I was trying to trace through it with a debugger,

That wouldn't work.
the debugger would disassemble it into assembler,
not python.

You can "disassemble" random bitstreams into assembler. That
doesn't make it a useful thing to do.

[Honestly, I think you're just trolling.]
 
B

Bjoern Schliessmann

Intel processors can only process machine language[...] There's no
way for a processor to understand any higher level language, even
assembler, since it is written with hexadecimal codes and basic
instructions like MOV, JMP, etc. The assembler compiler can
convert an assembler file to a binary executable, which the
processor can understand.

This may be true, but I think it's not bad to assume that machine
language and assembler are "almost the same" in this context, since
the translation between them is non-ambiguous (It's
just "recoding"; this is not the case with HLLs).
Both Linux and Windows compile down to binary files, which are
essentially 1's and 0's arranged in codes that are meaningful to
the processor.

(Not really -- object code files are composed of header data and
different segments, data and code, and only the code segments are
really meaningful to the processor.)
Once a python py file is compiled into a pyc file, I can
disassemble it into assembler.

But you _do_ know that pyc files are Python byte code, and you could
only directly disassemble them to Python byte code directly?
Assembler is nothing but codes, which are combinations of 1's and
0's.

No, assembly language source is readable text like this (gcc):

..LCFI4:
movl $0, %eax
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret

Machine language is binary codes, yes.
You can't read a pyc file in a hex editor,

By definition, you can read every file in a hex editor ...
but you can read it in a disassembler. It doesn't make a lot of
sense to me right now, but if I was trying to trace through it
with a debugger, the debugger would disassemble it into
assembler, not python.

Not at all. Again: It's Python byte code. Try experimenting with
pdb.

Regards,


Björn
 
J

Jeroen Ruigrok van der Werven

-On [20080125 14:07] said:
I'm surprised you've not been flamed to death by now - last time I
happened to write a pretty similar thing, I got a couple nut case
accusing me of being a liar trying to spread FUD about Java vs Python
respective VMs inner working, and even some usually sensible regulars
jumping in to label my saying as "misleading"...

I think your attitude in responding did not help much Bruno, if you want a
honest answer. And now you are using 'nut case'. What's with you using ad
hominems so readily?

Just an observation from peanut gallery. :)
 
O

over

Please, tell me you're kidding...

hehe...which part am I kidding about? The explanation was for someone
who thought python scripts were translated directly by the processor.
I had no idea how much he knew, so I kept it basic (no pun intended).

Or...do you disagree with what I'm saying? You didn't say much. I have
already disassembled a pyc file as a binary file. Maybe I was using
the term assembler too broadly. A binary compiled from an assembler
source would look similar in parts to what I disassembled.

That's not the point, however. I'm trying to say that a processor
cannot read a Python script, and since the Python interpreter as
stored on disk is essentially an assembler file, any Python script
must be sooner or later be converted to assembler form in order to be
read by its own interpreter. Whatever is typed in a Python script must
be converted to binary code.
 
O

over

On Jan 25, 11:10 pm, (e-mail address removed) wrote:
[...]

Gaah, is this what's going on?

ajaksu@Belkar:~$ cat error.txt
This is not assembler...

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

:/

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.
 
T

thebjorn

On Jan 25, 11:10 pm, (e-mail address removed) wrote: [...]

Gaah, is this what's going on?
ajaksu@Belkar:~$ cat error.txt
This is not assembler...
ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.

go troll somewhere else (you obviously don't know anything about
assembler and don't want to learn anything about Python).

-- bjorn
 
S

Steven D'Aprano

On Jan 25, 11:10 pm, (e-mail address removed) wrote:
[...]

Gaah, is this what's going on?

ajaksu@Belkar:~$ cat error.txt
This is not assembler...

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369 00000004 206973
and [bx+di+0x73],ch 00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65] 00000014 722E
jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

:/

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is the
stack pointer, on which it is operating.


Deary deary me...

Have a close look again at the actual contents of the file:

$ cat error.txt
This is not assembler...


If you run the text "This is not assembler..." through a disassembler, it
will obediently disassemble the bytes "This is not assembler..." into a
bunch of assembler opcodes. Unfortunately, although the individual
opcodes are "assembly", the whole set of them together is nonsense.
You'll see that it is nonsense the moment you try to execute the supposed
assembly code.

It would be a fascinating exercise to try to generate a set of bytes
which could be interpreted as both valid assembly code *and* valid
English text simultaneously. For interest, here you will find one quine
(program which prints its own source code) which is simultaneously valid
in C and TCL, and another which is valid in C and Lisp:

http://www.uwm.edu/~chruska/recursive/selfish.html
 
B

Bjoern Schliessmann

hehe...which part am I kidding about? The explanation was for
someone who thought python scripts were translated directly by the
processor.

Who might this have been? Surely not Tim.
I have already disassembled a pyc file as a binary file.

Have you? How's it look?
Maybe I was using the term assembler too broadly. A binary
compiled from an assembler source would look similar in parts to
what I disassembled.

What is this supposed to mean?
That's not the point, however. I'm trying to say that a processor
cannot read a Python script, and since the Python interpreter as
stored on disk is essentially an assembler file,

It isn't; it's an executable.
any Python script must be sooner or later be converted to
assembler form in order to be read by its own interpreter.

This "assembler form" is commonly referred to as "Python byte code".
Whatever is typed in a Python script must be converted to binary
code.

That, however, is true, though blurred.

Regards,


Björn
 
O

over

Intel processors can only process machine language[...] There's no
way for a processor to understand any higher level language, even
assembler, since it is written with hexadecimal codes and basic
instructions like MOV, JMP, etc. The assembler compiler can
convert an assembler file to a binary executable, which the
processor can understand.

This may be true, but I think it's not bad to assume that machine
language and assembler are "almost the same" in this context, since
the translation between them is non-ambiguous (It's
just "recoding"; this is not the case with HLLs).

I have no problem with your explanation. It's nearly impossible to
program in machine code, which is all 1's and 0's. Assembler makes it
infinitely easier by converting the machine 1's and 0's to their
hexadecimal equivalent and assigning an opcode name to them, like
PUSH, MOV, CALL, etc.

Still, the older machine-programmable processors used switches to set
the 1's and 0's. Or, the machine code was fed in on perforated cards
or tapes that were read. The computer read the switches, cards or
tapes, and set voltages according to what it scanned.

the difference is that machine code can be read directly, whereas
assembler has to be compiled in order to convert the opcodes to binary
data.
(Not really -- object code files are composed of header data and
different segments, data and code, and only the code segments are
really meaningful to the processor.)

I agree that the code segments, and the data, are all that's
meaningful to the processor. There are a few others, like interrupts
that affect the processor directly.

I understand what you're saying but I'm refering to an executable file
ready to be loaded into memory. It's stored on disk in a series of 1's
and 0's. As you say, there are also control codes on disk to separate
each byte along with CRC codes, timing codes, etc. However, that is
all stripped off by the hard drive electronics.

The actual file on disk is in a certain format that only the operating
system understands. But once the code is read in, it goes into memory
locations which hold individual arrays of bits. Each memory location
holds a precise number of bits corresponding to the particular code it
represents. For example, the ret instruction you mention below is
represent by hex C3 (0xC3), which represents the bits 11000011.

That's a machine code, since starting at 00000000 to 11111111, you
have 256 different codes available. When those 1's and 0's are
converted to volatges, the computer can analyze them and set circuits
in action which will bring about the desired operation. Since Linux is
written in C, it must convert down to machine code, just as Windows
must.
But you _do_ know that pyc files are Python byte code, and you could
only directly disassemble them to Python byte code directly?

that's the part I did not understand, so thanks for pointing that out.
What I disassembled did not make sense. I was looking for assembler
code, but I do understand a little bit about how the interpreter reads
them.

For example, from os.py, here's part of the script:

# Note: more names are added to __all__ later.
__all__ = ["altsep", "curdir", "pardir", "sep", "pathsep", "linesep",
"defpath", "name", "path", "devnull"]

here's the disassembly from os.pyc:

00000C04 06 00 00 00 dd 6
00000C08 61 6C 74 73 65 70 74 db 'altsept'
00000C0F 06 00 00 00 dd 6
00000C13 63 75 72 64 69 72 74 db 'curdirt'
00000C1A 06 00 00 00 dd 6
00000C1E 70 61 72 64 69 72 74 db 'pardirt'
00000C25 03 00 00 00 dd 3
00000C29 73 65 70 db 'sep'
00000C2C 74 07 00 00 dd 774h
00000C30 00 db 0
00000C31 70 61 74 68 73 65 70 db 'pathsep'
00000C38 74 07 00 00 dd 774h
00000C3C 00 db 0
00000C3D 6C 69 6E 65 73 65 70 db 'linesep'
00000C44 74 07 00 00 dd 774h
00000C48 00 db 0
00000C49 64 65 66 70 61 74 68 db 'defpath'
00000C50 74 04 00 00 dd offset unk_474
00000C54 00 db 0
00000C55 6E 61 6D 65 db 'name'
00000C59 74 04 00 00 dd offset unk_474
00000C5D 00 db 0
00000C5E 70 61 74 68 db 'path'
00000C62 74 07 00 00 dd 774h
00000C66 00 db 0
00000C67 64 65 76 6E 75 6C 6C db 'devnull'

you can see all the ASCII names in the disassembly like altsep,
curdir, etc. I'm not clear as to why they are all terminated with 0x74
= t, or if that's my poor interpretation. Some ASCII strings don't use
a 0 terminator. The point is that all the ASCII strings have numbers
between them which mean something to the interpreter. Also, they are
at a particular address. The interpreter has to know where to find
them.

The script is essentially gone. I'd like to know how to read the pyc
files, but that's getting away from my point that there is a link
between python scripts and assembler. At this point, I admit the code
above is NOT assembler, but sooner or later it will be converted to
machine code by the interpreter and the OS and that can be
disassembled as assembler.

I realize this is a complicated process and I can understand people
thinking I'm full of beans. Python needs an OS like Windows or Linux
to interface it to the processor. And all a processor can understand
is machine code.
No, assembly language source is readable text like this (gcc):

.LCFI4:
movl $0, %eax
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret

Yes, the source is readable like that, but the compiled binary is not.
A disaasembly shows both the source and the opcodes. The ret
statement above is a mneumonic for hex C3 in assembler. You have left
out the opcodes. Here's another example of assembler which is
disassembled from python.exe:

1D001250 FF 74 24 04 push [esp+arg_0]
1D001254 E8 D1 FF FF FF call 1D00122A
1D001259 F7 D8 neg eax
1D00125B 1B C0 sbb eax, eax
1D00125D F7 D8 neg eax
1D00125F 59 pop ecx
1D001260 48 dec eax
1D001261 C3 retn

the first column is obviously the address in memory. The second column
are opcodes, and the third column are mneumonics, English words
attached to the codes to give them meaning. The second and third
column mean the same thing.

A single opcode instruction like 59 = pop ecx and 48 = dec eax, are
self-explanatory. 59 is hexadecimal for binary 01011001, which is a
binary code. When a processor receives that binary as voltages, it is
wired to push the contents of the ecx register onto the stack.

The second instruction, call 1D00122A is not as straight forward. it
is made up of two parts: E8 = the opcode for CALL and the rest 'D1 FF
FF FF' is the opcode operator, or the data which the call is
referencing. In this case it's an address in memory that holds the
next instruction being called. It is written backward, however, which
is convention in certain assemblers. D1 FF FF FF actually means FF FF
FF D1.

This instruction uses F's to negate the instruction, telling the
processor to jump back. The signed number FFFFFFD1 = -2E. A call
counts from the end of it's opcode numbers which is 1D001258, and
1D001258 - 2E = 1D00122A, the address being called.

As you can see, it's all done with binary codes. The English
statements are purely for the convenience of the programmer. If you
look at the Intel definitons for assembler instructions, it lists both
the opcodes and the mneumonics.

I would agree with what you said earlier, that there is a similarity
between machine code and assembler. You can actually write in machine
code, but it is often entered in hexadecimal, requiring a hex to
binary interpreter. In tht case, the similarity to compiled assembler
is quite close.


Machine language is binary codes, yes.
if I knew what the intervening numbers meant I could. :)


By definition, you can read every file in a hex editor ...


Not at all. Again: It's Python byte code. Try experimenting with
pdb.

I will eventually...thanks for reply.
 
O

over

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.

go troll somewhere else (you obviously don't know anything about
assembler and don't want to learn anything about Python).

-- bjorn


before you start mouthing off, maybe you should learn assembler. If
you're really serious, go to the Intel site and get it from the horses
mouth. The Intel manual on assembler lists the mneumonics as well as
the opcodes for each instruction. It's not called the Intel Machine
Code and Assembler Language Manual. It's the bible on assembly
language, written by Intel.

If you're not so serious, here's a URL explaining it, along with an
excerpt from the article:

http://en.wikipedia.org/wiki/X86_assembly_language

Each x86 assembly instruction is represented by a mnemonic, which in
turn directly translates to a series of bytes which represent that
instruction, called an opcode. For example, the NOP instruction
translates to 0x90 and the HLT instruction translates to 0xF4. Some
opcodes have no mnemonics named after them and are undocumented.
However processors in the x86-family may interpret undocumented
opcodes differently and hence might render a program useless. In some
cases, invalid opcodes also generate processor exceptions.

As far as this line from your code above:

00000001 686973 push word 0x7369

68 of 686973 is the opcode for PUSH. Go on, look it up. The 6973 is
obviously the word address, 0x7369. Or, do you think that's
coincidence?

Don't fucking tell me about assembler, you asshole. I can read
disassembled code in my sleep.
 
O

over

It isn't; it's an executable.

I appreciated the intelligent response I received from you earlier,
now we're splitting hairs. :) Assembler, like any other higher
level language is written as a source file and is compiled to a
binary. An executable is one form of a binary, as is a dll. When you
view the disassembly of a binary, there is a distinct difference
between C, C++, Delphi, Visual Basic, DOS, or even between the
different file types like PE, NE, MZ, etc. But they all decompile to
assembler.

While they are in the binary format, they are exactly that...binary.
Who would want to interpret a long string of 1's and 0's. Binaries are
not stored in hexadecimal on disk nor are they in hexadecimal in
memory. But, all the 1's and 0's are in codes when they are
instructions or ASCII strings. No other high level language has the
one to one relationship that assembler has to machine code, the actual
language of the computer.

Dissassemblers can easily convert a binary to assembler due to the one
to one relationship between them. That can't be said for any other
higher level language. Converting back to C or Python would be a
nightmare, although it's becoming a reality. Converting a compiled
binary back to hexadecimal is basically a matter of converting the
binary to hexadecimal, as in a hex editor. There are exceptions to
that, of course, especially with compound assembler statements that
use extensions to differentiate between registers.

This "assembler form" is commonly referred to as "Python byte code".
thanks for pointing that out. It lead me to this page:

http://docs.python.org/lib/module-dis.html

where it is explained that the opcodes are in Include/opcode.h. I'll
take a look at that.

The light goes on. From opcode.h:

#define PRINT_NEWLINE_TO 74

All the ASCIi strings end with 0x74 in the disassembly. I have noted
that Python uses a newline as a line feed/carriage return. Now I'm
getting it. It could all be disassembled with a hex editor, but a
disassembler is better for getting things in order.

OK. So the pyc files use those defs...that's cool.
 
M

Marc 'BlackJack' Rintsch

Don't fucking tell me about assembler, you asshole. I can read
disassembled code in my sleep.

Yes you can read it, but obviously you don't understand it.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Marc 'BlackJack' Rintsch

On Sat, 26 Jan 2008 14:47:50 +0100, Bjoern Schliessmann

The script is essentially gone. I'd like to know how to read the pyc
files, but that's getting away from my point that there is a link
between python scripts and assembler. At this point, I admit the code
above is NOT assembler, but sooner or later it will be converted to
machine code by the interpreter and the OS and that can be
disassembled as assembler.

No it will not be converted to assembler. The byte code is *interpreted*
by Python, not compiled to assembler. If you want to know how this
happens get the C source code of the interpreter and don't waste your time
with disassembling `python.exe`. C is much easier to read and there are
useful comments too.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Steven D'Aprano

I can understand people thinking I'm full of beans.


Oh no, not full of beans. Full of something, but not beans.

Everything you have written about assembly, machine code, compilers,
Linux, Python and so forth has been a confused mish-mash of half-truths,
distortions, vaguely correct factoids and complete nonsense.

I'm starting to wonder if it is possible for somebody to be
simultaneously so self-assured and so ignorant, or if we're being trolled.
 
M

Marc 'BlackJack' Rintsch

Oh no, not full of beans. Full of something, but not beans.

Everything you have written about assembly, machine code, compilers,
Linux, Python and so forth has been a confused mish-mash of half-truths,
distortions, vaguely correct factoids and complete nonsense.

I'm starting to wonder if it is possible for somebody to be
simultaneously so self-assured and so ignorant, or if we're being trolled.

I recently learned that this is called the Dunning-Kruger effect:

The Dunning-Kruger effect is the phenomenon wherein people who have
little knowledge think that they know more than others who have much
more knowledge.

[…]

The phenomenon was demonstrated in a series of experiments performed by
Justin Kruger and David Dunning, then both of Cornell University. Their
results were published in the Journal of Personality and Social
Psychology in December 1999.

http://en.wikipedia.org/wiki/Dunning-Kruger_effect

See, there's almost always a rational explanation. ;-)

Ciao,
Marc 'BlackJack' Rintsch
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top