translating Python to Assembler

G

Grant Edwards

I'm starting to wonder if it is possible for somebody to be
simultaneously so self-assured and so ignorant, or if we're
being trolled.

I recently learned that this is called the Dunning-Kruger effect:

The Dunning-Kruger effect is the phenomenon wherein people who have
little knowledge think that they know more than others who have much
more knowledge.

[?]

The phenomenon was demonstrated in a series of experiments performed by
Justin Kruger and David Dunning, then both of Cornell University. Their
results were published in the Journal of Personality and Social
Psychology in December 1999.

I remember reading that paper about a year ago and it sure
seemd to explain the behavior of a number of people I've known.
Not only is it possible to be simultaneously self-assured and
ignorant, that appears to be the normal way that the human mind
works.

.... must restist ... urge... to mention... Bush...

Damn.
 
W

Wildemar Wildenburger

Grant said:
The Dunning-Kruger effect is the phenomenon wherein people who have
little knowledge think that they know more than others who have much
more knowledge.
[snip]
[snip as well]
... must restist ... urge... to mention... Bush...
Well, I think that G.W. Bush knows perfectly well that he is not really
up to the task. I still suspect that it never really was his decision to
become president, if you follow me.

/W
(What do I care, he's not my president after all ... although, in a way
.... YYAAAARRRGGGGHHHH!)
 
J

John Machin

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A
:/
not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.
go troll somewhere else (you obviously don't know anything about
assembler and don't want to learn anything about Python).

before you start mouthing off, maybe you should learn assembler. If
you're really serious, go to the Intel site and get it from the horses
mouth. The Intel manual on assembler lists the mneumonics as well as
the opcodes for each instruction. It's not called the Intel Machine
Code and Assembler Language Manual. It's the bible on assembly
language, written by Intel.

If you're not so serious, here's a URL explaining it, along with an
excerpt from the article:

http://en.wikipedia.org/wiki/X86_assembly_language

Each x86 assembly instruction is represented by a mnemonic, which in
turn directly translates to a series of bytes which represent that
instruction, called an opcode. For example, the NOP instruction
translates to 0x90 and the HLT instruction translates to 0xF4. Some
opcodes have no mnemonics named after them and are undocumented.
However processors in the x86-family may interpret undocumented
opcodes differently and hence might render a program useless. In some
cases, invalid opcodes also generate processor exceptions.

As far as this line from your code above:

00000001 686973 push word 0x7369

68 of 686973 is the opcode for PUSH. Go on, look it up. The 6973 is
obviously the word address, 0x7369. Or, do you think that's
coincidence?

Don't fucking tell me about assembler, you asshole. I can read
disassembled code in my sleep.

What was originally posted was:
"""
ajaksu@Belkar:~$ cat error.txt
This is not assembler...

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
[snip]
"""

Read it again -- he's "disassembled" the text "This is not
assembler..."

54 -> "T"
686973 -> "his"
206973 -> " is"

but you say "68 of 686973 is the opcode for PUSH. Go on, look it up.
The 6973 is obviously the word address, 0x7369. Or, do you think
that's coincidence?"

You are a genius of a kind encountered only very rarely. Care to share
with us your decryption of the Voynich manuscript?
 
A

ajaksu

This message got huge :/

Sorry for being so cryptic and unhelpful. I now believe that you're
incurring in a (quite deep) misunderstanding and wish to make things
clear for both of us :)

On Jan 25, 11:10 pm, (e-mail address removed) wrote: [...]

Gaah, is this what's going on?
ajaksu@Belkar:~$ cat error.txt
This is not assembler...
ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.

What I did above was:
1- create a file called "error.txt" that contains the string "This is
not assembler..."
2- show the contents of the file ("cat" being the relevant command)
3- run the NetWideDisassembler (ndisasm) on error.txt
4- watch as it "disassembled" the text file (in fact, "assembling" the
code above reconstructs part of the string!)
5- conclude that you were misguided by this behavior of
disassemblers, for AFAIK .pyc files contain Python
"opcodes" (bytecode), that in no way I can think of could be parsed by
a generic disassembler
6- form a belief that you were trying to understand meaningless
"assembler" like the above (that would have no bearing on what Python
does!)

Now, it seems that we're in flaming mode and that is unfortunate,
because I do believe in your expertise. In part, because my father was
a systems analyst for IBM mainframes and knows (a huge) lot about
informatics. However, I've seen him, due to simple misunderstandings
like this, building a complex scenario to explain his troubles with
MSWord. I believe this is what's happening here, so I suggest that we
take a step back and stop calling names.

Given that you're in the uncomfortable place of the "troll assigned by
votes" outsider in this issue, let me expose some relevant data. The
people you're pissed off with (and vice-versa) are very competent and
knowledgeable Python (and other languages) programmers, very kind to
newcomers and notably helpful (as you might find out lurking in this
newsgroup or reading the archives). They spend time and energy helping
people to solve problems and understand the language. Seriously, they
know about assembler (a lot more than I do) and how Python works. And
they know and respect each other.

Now, your attitude and faith in your own assumptions (of which,
"the .pyc contains assembler" in special) was both rude and upsetting.
This doesn't mean that you're not an assembler expert (I believe you
are). But it seemed like you were trying to teach us how Python works,
and that was considered offensive, specially due to your words.

OTOH, my responses were cryptic, unhelpful and smell of "mob
thinking". While Steven D'Aprano and others showed a lot more of
patience and willingness to help. So please forgive me and please PAY
ATTENTION to those trying to HELP and make things clearer to you.

As a simple example of my own e Dunning-Kruger effect, I was sure I'd
get errors on trying to "assemble" the output of the disassembling,
but it does roundtrip part of the string and I was baffled. I'd guess
you know why, I have no idea. The 0x74 finding was also curious, you
are indeed getting part of the binary format of bytecode, but (AFAICT)
you won't find real assembler there.

In summary, you can show us what you know and put your knowledge
(instead of what you got wrong and how you upset people) in focus. Try
to set things right. Believe me, this here community packs an uncommon
amount of greatness and openness.

HTH,
Daniel
 
B

Bjoern Schliessmann

I have no problem with your explanation. It's nearly impossible to
program in machine code, which is all 1's and 0's.

Not really; it's "voltage" or "no voltage" at different signal lines
in the processor. The dual system is just one representation you
could choose. More common (and practical) are hexadecimal or octal.
the difference is that machine code can be read directly, whereas
assembler has to be compiled in order to convert the opcodes to
binary data.

As I said before, IMHO this "compilation" if trivial compared to HLL
compilation, since it's just a translation from opcodes to numbers
and labels to addresses, respectively.

HLL compilers do much more; they translate high-level control
structures to low-level implementation (which is ambiguous). Often,
optimisation is employed, which may e. g. cause that a loop is
unrolled (vanishes in assembly).
I agree that the code segments, and the data, are all that's
meaningful to the processor. There are a few others, like
interrupts that affect the processor directly.

Interrupts and segments are orthogonal, don't you think?
I understand what you're saying but I'm refering to an executable
file ready to be loaded into memory.

Obviously not, since I was referring to such a file, too. Try
reading about "real" executable formats like ELF.
It's stored on disk in a series of 1's and 0's.

No, it's stored using a complex chain of magnetic fields. You _can_
interpret it as dual numbers, yes. But it's impractical and the
choice is up to the viewer.
The actual file on disk is in a certain format that only the
operating system understands. But once the code is read in, it
goes into memory locations which hold individual arrays of bits.

I agree. (Before, you wrote differently:
Both Linux and Windows compile down to binary files, which are
essentially 1's and 0's arranged in codes that are meaningful to
the processor.

E. g. the ELF header and data segments mean nothing of sense to the
processor itself.)
That's a machine code, since starting at 00000000 to 11111111, you
have 256 different codes available.

I'm afraid it's not that simple. IA-32 opcodes, for example, are
complex bit sequences and don't always have the same length.
Primary opcodes consist of up to three bytes in this architecture.

With some RISC CPUs, there is a machine instruction length
limitation of e. g. one word. But the IA-32 doesn't have this
limitation.
But you _do_ know that pyc files are Python byte code, and you
could only directly disassemble them to Python byte code
directly?

that's the part I did not understand, so thanks for pointing that
out. What I disassembled did not make sense. I was looking for
assembler code, but I do understand a little bit about how the
interpreter reads them.

For example, from os.py, here's part of the script:

# Note: more names are added to __all__ later.
__all__ = ["altsep", "curdir", "pardir", "sep", "pathsep",
"linesep",
"defpath", "name", "path", "devnull"]

here's the disassembly from os.pyc:

.... which is completely pointless because this is no IA-32 code
segment which the processor could execute, but a custom data file
format. I'd rather try this, for example:
.... i += 1
.... return argument
.... 2 0 LOAD_FAST 0 (i)
3 LOAD_CONST 1 (1)
6 INPLACE_ADD
7 STORE_FAST 0 (i)

3 10 LOAD_GLOBAL 0 (argument)
13 RETURN_VALUE
The Python VM, though, is stack-based, not register-based as most
CPUs. That's why the opcodes are quite different.
The script is essentially gone. I'd like to know how to read the
pyc files, but that's getting away from my point that there is a
link between python scripts and assembler. At this point, I admit
the code above is NOT assembler, but sooner or later it will be
converted to machine code by the interpreter and the OS and that
can be disassembled as assembler.

Yes. But the interpreter doesn't convert the entire file to machine
language. It reads one instruction after another and, amongst other
things, outputs corresponding machine code which "does" what's
intended by the byte code instruction.
Python needs an OS like Windows or Linux to interface it to the
processor.

Not really. The CPython executable contains machine code directly
executable by the host processor. The OS just

* provides routines for accessing peripherals and allocating memory,
* makes it possible that multiple programs can run side by side,
* and loads the executable and sets it up in memory for execution.
Yes, the source is readable like that, but the compiled binary is
not.

For a machine, it is. The translation is 1:1, trivial.
A disaasembly shows both the source and the opcodes.

The output I posted was directly from the GNU C compiler (compiled
from an empty "main" function). I got it by using a parameter that
tells the compiler to leave out the last step of generating machine
code from assembly, and save the source.

A "disassembly" is the other way round. The hexadecimal
representation of the source in the leftmost columns is completely
redundant and practically irrelevant for a human being.
The second column are opcodes,

Not only. It's machine code instructions, i. e. opcodes and
operands.
and the third column are mneumonics, English words attached to the
codes to give them meaning.

They're mn_e_monics, and they're not really english (what kind of
english words would RET, JLE or CMP be?).
The second and third column mean the same thing.

Not at all! They're the operands and can be memory addresses,
registers or fixed values.
A single opcode instruction like 59 = pop ecx and 48 = dec eax,
are self-explanatory.

It's a machine instruction which consists of the opcode POP and the
operand ECX.
The second instruction, call 1D00122A is not as straight forward.
it is made up of two parts: E8 = the opcode for CALL and the rest
'D1 FF FF FF' is the opcode operator

I'm afraid not -- it's the operand.
I would agree with what you said earlier, that there is a
similarity between machine code and assembler.

Is there, actually? :)
You can actually write in machine code, but it is often entered in
hexadecimal, requiring a hex to binary interpreter.

IMHO, this makes no sense. For example, the memory contents
represented by binary 1000 and 0x10 are exactly the same. Thus, it
doesn't matter at all how you enter or view it, and it's completely
up to the user. The CPU understands both *exactly* the same way,
since they are the same: voltage levels at signal lines.
if I knew what the intervening numbers meant I could. :)

(*You* wrote the above. Please don't drop quoting headers if you
quote this deep.)

Regards,


Björn
 
B

Bjoern Schliessmann

I appreciated the intelligent response I received from you
earlier, now we're splitting hairs. :)

Not at all. Assembly source is ASCII text. An executable commonly
consists of a binary header (which contains various information
=> man elf) as well as code and data segments. Normally, you're only
guaranteed to find machine language inside the code segments.
Assembler, like any other higher level language

Assembler is _no_ high level language, though there are some
assembly languages striving for resembling HLLs.

http://webster.cs.ucr.edu/AsmTools/HLA/index.html
is written as a source file and is compiled to a binary.

BMPs are binaries, too. Assembly code is compiled to object code
files.
An executable is one form of a binary, as is a dll. When you view
the disassembly of a binary, there is a distinct difference
between C, C++, Delphi, Visual Basic, DOS,

I don't think so. How a HLL source is translated to machine code
depends on the compiler, and there are cross compilers.
or even between the different file types like PE, NE, MZ, etc.
Yes.

But they all decompile to assembler.

No. They all _contain_ code segments (which contain machine code),
but also different data.
While they are in the binary format, they are exactly
that...binary.
http://en.wikipedia.org/wiki/Binary_data

Who would want to interpret a long string of 1's and 0's. Binaries
are not stored in hexadecimal on disk nor are they in hexadecimal
in memory. But, all the 1's and 0's are in codes when they are
instructions or ASCII strings.

No -- they're voltages or magnetic fields. (I never saw "0"s or "1"s
in a memory chip or on a hard disk.) The representation of this
data is up to the viewing human being to choose.
No other high level language has the one to one relationship that
assembler has to machine code, the actual language of the
computer.

Yes. That's why Assembly language is not "high level", but "low
level".
All the ASCIi strings end with 0x74 in the disassembly.
*sigh*

I have noted that Python uses a newline as a line feed/carriage
return.

(The means of line separation is not chosen just like this by Python
users. It's convention depending on the OS and the application.)
Now I'm getting it. It could all be disassembled with a hex
editor, but a disassembler is better for getting things in order.

Argl. A hex editor just displays a binary file as hexadecimal
numbers, it does _not_ disassemble.

"Disassembly" refers to _interpreting_ a file as machine
instructions of one particular architecture. This, of course, only
makes sense if this binary file actually contains machine
instructions that make sense, not if it's really a picture or a
sound file.

Regards,


Björn
 
B

Bruno Desthuilliers

Paul Boddie a écrit :
Well, it is important to make distinctions when people are wondering,
"If Python is 'so slow' and yet everyone tells me that the way it is
executed is 'just like Java', where does the difference in performance
come from?" Your responses seemed to focus more on waving that issue
away and leaving the whole topic in the realm of mystery. The result:
"Python is just like Java apparently, but it's slower and I don't know
why."

I'm afraid you didn't read the whole post :

"""
So while CPython may possibly be too slow for your application (it can
indeed be somewhat slow for some tasks), the reasons are elsewhere
(hint: how can a compiler safely optimize anything in a language so
dynamic that even the class of an object can be changed at runtime ?) ."""

I may agree this might not have been stated explicitily enough, but this
was about JIT optimizing compilers. Also, a couple posts later - FWIW,
to answer the OP "how does it comes it slower if it's similar to Java"
question :

"""
Java's declarative static typing allow agressive just-in-time
optimizations - which is not the case in Python due to it's higly
dynamic nature.
"""
 
B

Bruno Desthuilliers

Jeroen Ruigrok van der Werven a écrit :
-On [20080125 14:07] said:
I'm surprised you've not been flamed to death by now - last time I
happened to write a pretty similar thing, I got a couple nut case
accusing me of being a liar trying to spread FUD about Java vs Python
respective VMs inner working, and even some usually sensible regulars
jumping in to label my saying as "misleading"...

I think your attitude in responding did not help much Bruno, if you want a
honest answer.

Possibly, yes. Note that being personnally insulted for stating
something both technically correct *and* (as is the case here) commonly
stated here doesn't help either.
 
G

Grant Edwards

The script is essentially gone. I'd like to know how to read
No it won't. In any of the "normal" implementations, bytecodes
are not converted to machine code by the interpreter. Rather,
the interpreter simulates a machine that runs the byte codes.

No it can't. The result of feeding bytecodes to the VM isn't
output of machine code. It's changes in state of _data_
structures that are independate of the processor's instruction
set.
Yes. But the interpreter doesn't convert the entire file to machine
language. It reads one instruction after another and, amongst other
things, outputs corresponding machine code which "does" what's
intended by the byte code instruction.

No, it doesn't output corresponding machine code (that's what
some Java JIT implementations do, but I'm not aware of any
Python implementations that do that). The virtual machine
interpreter just does the action specified by the bytecode.
 
D

Dennis Lee Bieber

On Jan 25, 11:10 pm, (e-mail address removed) wrote:
[...]

Gaah, is this what's going on?

ajaksu@Belkar:~$ cat error.txt
This is not assembler...

ajaksu@Belkar:~$ ndisasm error.txt
00000000 54 push sp
00000001 686973 push word 0x7369
00000004 206973 and [bx+di+0x73],ch
00000007 206E6F and [bp+0x6f],ch
0000000A 7420 jz 0x2c
0000000C 61 popa
0000000D 7373 jnc 0x82
0000000F 656D gs insw
00000011 626C65 bound bp,[si+0x65]
00000014 722E jc 0x44
00000016 2E db 0x2E
00000017 2E db 0x2E
00000018 0A db 0x0A

:/

not sure what you're saying. Sure looks like assembler to me. Take the
'54 push sp'. The 54 is an assembler opcode for push and the sp is
the stack pointer, on which it is operating.

Emphasis is "LOOKS LIKE"... Look at the above command sequences --
That's just a TEXT string that was fed to a disassembler. It is NOT
executable machine instructions.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

Bjoern Schliessmann

Grant said:
No, it doesn't output corresponding machine code (that's what
some Java JIT implementations do, but I'm not aware of any
Python implementations that do that). The virtual machine
interpreter just does the action specified by the bytecode.

By "outputs corresponding machine code" I meant "feeds corresponding
machine code to the CPU" to make the analogy clearer. Which can
mean a function call.

Regards,


Björn
 
G

Grant Edwards

By "outputs corresponding machine code" I meant "feeds corresponding
machine code to the CPU" to make the analogy clearer. Which can
mean a function call.

OK, but I think you're reaching a little. :) It's pretty hard
for me to think of a program as something that's "feeding
machine code to the CPU".

In my mind, the VM is a program that's reading data from one
source (the bytecode files) and performing operations on a
second set of data (in-memory structures representing Python
objects) based on what is found in that first set of data.
 
A

Albert van der Horst

Once a python py file is compiled into a pyc file, I can disassemble
it into assembler. Assembler is nothing but codes, which are
combinations of 1's and 0's. You can't read a pyc file in a hex
editor, but you can read it in a disassembler. It doesn't make a lot
of sense to me right now, but if I was trying to trace through it with
a debugger, the debugger would disassemble it into assembler, not
python.

You know that python byte code is portable across architectures.

So you are disassembling using an Intel disassembler?
How can that make sense if you are on a SUN work station with a
non-Intel processor?

Groetjes Albert
 
T

thebjorn

before you start mouthing off, maybe you should learn assembler.

I suppose I shouldn't feed the trolls... but what the heck ;-P I
could of course try to be helpful, but I don't think I have the skillz
needed.

I might know a thing or two about assembly though, I started out on
the Commodore 64, then I wrote TSR programs (both .com and .exe ;-)
for my IBM AT, and I wrote a compiler for a scheme-like functional
language (with SML-like syntax) that targeted the Motorola 68040
(which was inside my NeXTstation...).

[snip]
Don't fucking tell me about assembler, you asshole. I can read
disassembled code in my sleep.

Watch the language, fucktard. Perhaps you should try _writing_
something in assembly for a change? How about linking up a "hello
world" executable? You seem too clueless to be for real though, so my
original advice stands.

-- bjorn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top