Technical Perspectives | Reflections on Citation Dynamic Language Object Models in Virtual Machines

Ontology's NeoVM virtual machine has added several new instructions such as DCALL, HAS_KEY, KEYS, and VALUES. Therefore, the design of NeoVM-based citation dynamic language objects is theoretically feasible, which enables current language support to be closer to native semantics.
The necessity of object model design
Ontology NeoVM has four object semantics exposed to users: bytearray, array, struct, and map. The current implementation of Python, Go, and C# compilers directly reuses these four object semantics, which creates several problems:
  • First of all, the basic objects of high-level languages ​​often have more than these kinds of object semantics, and there will be many-to-one object semantics.

The operations of different objects have different behaviors, and the consequence is that the semantics of one of the objects must be sacrificed.

  • Second, the semantics of the underlying objects corresponding to the high-level language objects are not necessarily completely equivalent.
  • In summary, we need to design a more general object model framework to adapt to the semantic objects of different languages ​​to meet the support of multi-language smart contracts.
    In the case of Python, Python is a citation-like, dynamically typed language that acquires less information at compile time. Currently, Ontology's Neptune compiler basically implements Python's arithmetic logic and control logic. Statically typed languages ​​such as Go and C# can handle type checking, object semantic distinction, etc. at compile time. However, for a dynamically typed language such as Python, if there is no more complete object memory model, its expression ability is limited, and the semantics of different objects cannot be accurately distinguished.
    Based on Ontology NeoVM, this paper proposes a design of a reference dynamic language memory model as a theoretical analysis for upgrading and refactoring Neptune and the Go compiler and more accurately implementing other language compilers.
    Object model
    In theory, the semantic model of the underlying instructions needs to be sufficiently simple and abstract to satisfy the semantics of different types of languages. And it is difficult to have a set of instruction architecture that can meet the operational requirements of all language semantics. So most high-level languages ​​redefine specific semantic models and build on top of specific virtual machines. Relatively low-level languages ​​such as Rust, C, and C++ are compiled directly and run on the CPU.
    The best way to model the in-memory object is not to use the built-in semantic objects of the Ontology NeoVM directly, but to redesign its object model based on language characteristics, and the more precise semantics are more developer-friendly. But the cost of rewriting the semantics of the design object is that the same logic implementation will generate several times the bytecode generated by the current implementation, and the compiler implementation will be more complicated.
    Currently according to the semantic design of everything in Python, all objects are implemented using map or array. To simplify the expression, this assumes that the object is implemented using map. The first key of the map is the built-in __type__ or is represented by an encoding, and the compiler checks that the attribute key field is not part of the system reserved field. In Python, the basic object types are: Number, String, List, Tuple, Set, Dict; the basic operators are [] (subscript), +/-/*/%///, and so on. These operators are the member properties of the corresponding object. At run time, operators can be distinguished by semantics through the type field. Similarly, functions are also objects. Each object can be represented by the following structure:

    In a specific implementation, since a string occupies a large bytecode space or affects performance, for a global structure, it can be statically mapped to an integer representation.
    Symbol table and relocation
    To implement dynamic typing, the symbol table needs to be saved in the runtime environment, the global runtime object environment. For operations such as addition, subtraction, multiplication and division, the object type can be determined by the modification of the object type and the operator name; for other member functions of the object, the object name is combined with the member function name modification. The timing of relocation is when the compilation is complete, all function offsets have been determined. After the system has built the global object, it immediately jumps to the relocation function to process the symbol information that needs to be relocated. When you need to access an object, you can get the offset of the object correctly, such as the function call is pseudo code:
    Static mapping of global objects
    Since the symbol index is directly used, the bytecode is increased, and the processing performance of the ARRAY bytecode is higher than the map, so the stacking of the symbols is minimized during compilation, and the global or local variables are used by the static symbol table. , map to index, reduce the generation of bytecode, improve performance. At the same time, check for more syntax errors at compile time, such as undefined, duplicate definitions, and so on. Global objects can be saved in the array structure:


    Member object access and object inheritance processing
    As mentioned above, global objects are stored in the global runtime environment, while local objects are stored in the function's local runtime environment. The object variables of an object have been fetched from the runtime environment before being accessed. Therefore, when accessing a member variable, it can be obtained according to the key of the index member variable. Because it is a dynamic type, it cannot be mapped to an index integer based on information at compile time. You can only use variable names directly. The pseudo code is as follows:


    Operator implementation and overloading
    Due to the transformation of the object model, all operator logic cannot directly use the instruction logic of NeoVM and needs to be implemented with the logic of the corresponding object. The semantics of each operator is bound to a specific object. The operator is obtained by ast at compile time. For different objects, different object operator functions are generated at compile time; the runtime jumps to the corresponding object handler according to the object type. For example, the addition of string objects and the addition of int objects are two different function implementations.
    So according to the above method, any object can override the add function to implement the new addition semantic definition of the object. Other operations are similar. For system built-in types, such as Int, string, list, map. Both need to generate built-in operator handlers at compile time.
    Control logic
    The control logic has little to do with object semantics, but the control instructions need to convert the object to Boolean or of the Ontology NeoVM.
    NeoVM Service processing
    The data returned by the NeoVM service is semantic of the Ontology NeoVM, so it needs to be constructed as the object type of the current design according to the return type. For Syscall translation, you cannot use Syscall + servicename directly. You also need to add the corresponding object type construct later. For syscall incoming parameters, you also need to restore the object to the underlying semantics of the Ontology NeoVM.
    in conclusion
    Due to the diversity of language semantics, it is not possible to directly support the native semantics of the language by directly multiplexing the native semantics of the Ontology NeoVM. The design of the object model can make the semantic semantics supported by the smart contract more precise and more powerful. By optimizing the continually close to the native semantics, the existing built-in objects int, string, list, and map support richer and more precise native semantics. , more friendly to developers. However, this also produces several times the bytecode generated by the current compiler, and the compiler implementation is more complicated.