Thursday, 5 May 2016

compiler construction - What is the difference between compilation and interpretation?



I just had a conversation with a colleague and where were talking about the V8 JavaScript engine. According to Wikipedia,





V8 compiles JavaScript to native machine code [...] before executing it, instead of more traditional techniques such as interpreting bytecode or compiling the whole program to machine code and executing it from a filesystem.




where (correct me if I'm wrong) "interpreting bytecode" is the way Java works, and "compiling the whole program" would apply for languages like C or C++. Now we were wondering, debating and posing false assertions and presumptions about differences, similarities. To end this, I recommended asking the experts on SO.



So, who is able to





  1. name, explain and/or reference all major methods (e.g. precompiling vs. runtime interpretation)

  2. to visualize or to provide a scheme about the relations between source, compilation and interpretation

  3. give examples (name programming languages) for the major methods of #1.






Notes:





  • I am not looking for a long prosaic essay about the different paradigms, but an visually supported, quick overview.

  • I know that Stackoverflow is not intended to be a encyclopedia for programmers (but rather a Q&A platform for more specific questions). But since I can find a lot of popular questions, that kind of provide an encyclopedic view to certain topics (e.g. [1], [2], [3], [4], [5]), I started this question.

  • If this question would rather fit into any other StackExchange site (e.g. cstheory), please let me know or flag this question for moderation.


Answer



It's near-impossible to answer your question for one simple reason: There aren't a few approaches, they are rather a continuum. The actual code involved across this continuum is also fairly identical, the only difference being when things happen, and whether intermediate steps are saved in some way or not. Various points across this continuum (which is not a single line, a progression, but rather more of a rectangle with different corners to which you can be close) are:




  1. Reading source code

  2. Understanding the code


  3. Executing what you understood

  4. Caching various intermediate data along the road, or even persistently saving them to disk.



For example, a purely interpreted programming language Pretty much doesn't do #4 and #2 kinda happens implicitly between 1 and 3 so you'd barely notice it. It just reads sections of the code, and immediately reacts to them. This means there is low overhead to actually starting execution, but e.g. in a loop the same lines of text get read and re-read again.



Diagram of the balance of an Interpreter (not much caching going on)



In another corner of the rectangle, there are traditionally compiled languages, where usually, item #4 consists of permanently saving actual machine code to a file, which can then be run at a later time. This means you wait a comparatively long while at the beginning until the entire program is translated (even if you're only calling a single function in it), but OTOH loops are faster because the source doesn't need to be read again.




Diagram of the balance of a Compiler (mostly caching)



And then there are things in between, e.g. a virtual machine: For portability, many programming languages don't compile to actual machine code, but to a byte code. There is then a compiler that generates the byte code, and an interpreter that takes this bytecode and actually runs it (effectively "turning it into machine code"). While this is generally slower than compiling and going directly to machine code, it is easier to port such a language to another platform, as you only have to port the bytecode interpreter, which is often written in a high-level language, meaning you can use an existing compiler to do this "effective translation to machine code", and don't have to make and maintain a backend for each platform you want to run on. Also, this can be faster if you can perform the compilation to bytecode once, and then only distribute the compiled bytecode, so that other people do not have to spend CPU cycles on e.g. running the optimizer over your code, and only pay for the bytecode-to-native translation, which may be negligible in your use case. Also, you're not handing out your source code.



Another thing in between would be a Just-in-Time compiler (JIT), which is effectively a interpreter that keeps around code it has run once, in compiled form. This 'keeping around' makes it slower than a pure interpreter (e.g. added overhead and RAM use leading to swapping and disk access), but makes it faster when repeatedly executing a stretch of code. It can also be faster than a pure compiler for code where e.g. only a single function is repeatedly called, because it doesn't waste time compiling the rest of the program if it isn't used.



And finally, you can find other spots on this rectangle e.g. by not saving compiled code permanently, but purging compiled code from the cache again. This way you can e.g. save disk space or RAM on embedded systems, at the cost of maybe having to compile a seldom-used piece of code a second time. Many JIT compilers do this.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...