MULTITHREADING LPMUD GAME DRIVER

Petri Virkkula <[email protected]>
Balanced Alternative Techniques Ry

Abstract

Making a multithreaded LPMud game driver would be a very challenging project. Finishing a such could be rewarding even if the multithreading itself did not have any other advantages.

I am giving out my opinion about such project in this document meaning that I am trying to answer to questions "why" and "how" to make a multithreaded driver. Additionally I give a small comparison with other driver projects.

Table Of Contents
List Figures
1. Introduction
2. Advantages Of Multithreading
3. Disadvantages Of Multithreading
- 3.1 Synchronization Overhead
- 3.2 Code Complexity
4. Using Multithreading
5. Related Work
- 5.1 DOME
- 5.2 Amoeba And Plan9
6. Summary And Future Directions
- 6.1 Summary
- 6.2 Future Directions
Glossary
References
Document History

1. Introduction

1.1 Overview

This document contains my opinion why and how to make a multithreaded LPMud game driver. I think that technical point of view it is very desirable to have a multithreading game driver because the advantages of such design outweights the disadvantages.

Much of my thoughts are based on knowledge about currently available hardware for such game driver if one were ever programmed.

1.2 Changed Design Objectives

I think the original design objectives of lpmud game drivers were to save as much as possible system resources (such as memory, CPU time, etc.) because at that time muds were running on computers shared with other activities. Muds were actually supposed to use only the excess resources and not disturb the other uses of computers, eg. speed-space tradeoff problems were solved in favor of saving space instead of going after speed.

Currently there are many muds running on dedicated computers and using still drivers designed with principles that clearly do not apply in such dedicated environments. Additionally prices of things like memory and a second CPU are dramatically lower than they used to be. In presence of such dedicated computers I think it is time to change the basic questions "how to save CPU, RAM or disk space" to be "how to make the mud to utilize all the available resources in the best possible way".

However, clearly not all muds have a dedicated computers or too much excess hardware available. This document can still be usefull even for that kind of muds though only a limited way.

1.3 Document Organization

Chapter 2 "Advantages Of Multithreading" tries to answer to the question "why to use multiple threads". While there are many reasons for it, only few of them are discussed there.

Chapter 3 "Disadvantages Of Multithreading" tries to list some of the problems raising from usage of multithreading.

Chapter 4 "Using Multithreading contains some answers to the question "how to use multithreading in the driver". These answers are no way complete, they are only some of solutions I have been thinking.

Chapter 5 "Related Work" contains information about related project with a little comparison between my ideas and the ideas used by those other projects.

Chapter 6 "Summary And Future Directions" contains summary of this document. The chapter gives also some future directions.

2. Advantages Of Multithreading

2.1 Modularity

Modularity can be more easily achieved by utilizing threads by dividing a driver into smaller "tasks" and using a thread per a task. Threads are not however requirement for modularity, but threads can help the implementation dramatically.

A good example of a separate task is management of calls scheduled for later execution, so called call_outs. There could be a thread just waiting for additions/deletions of call_outs to the point when a next call_out must be executed. The result would be a module that has two entry points: one for a call_out addition and another for deletion. Other parts of the game driver would not need to worry about the implementation and especially scheduling of the execution of call_outs.

A modular design has many advantages over monolithic design. The first and most obvious is reduction of bindings between different parts of the driver. In the above example, there would clearly be only three depencies (as addition of standard utility function for memory allocation and so on): two exported functions and one function that is called by the call_out management thread when a call_out need to be executed.

The second advantage follows directly from the first one: a module with a few (well-defined) bindings can be tested more easily, especially by using automatic testing methodlogies. This a great advantage that should result a more stable driver if testing is done seriously.

2.2 Scalability

Prices for multiprocessor machines have gone lower and lower. It is now longer exceptional to have a multiprocessor machines even at home. Without multithreading only one CPU can be utilized by a driver, which might matter much in case of dedicated computer. By making the driver multithreaded from the beginning, it scales automatically from uniprocessing computers to multiprocessor computers.

If an application is first divided into smaller tasks utilizing threads and communication between the tasks is done through well-defined interfaces, it is simpler to scale the application to to utilize multiple computers. Simple function calls can be modified to be utilize messages or RPC calls. Such messages can easily be exchanged through shared memory and through network between multiple computers.

Separating all network I/O into a separate thread(s) is a good example. There could be one (or more) thread(s) doing only single thing: wait for a input data, and buffer the data until a complete command has been received. The command is then passed into separate execution thread (which is doing all the time the most important thing from user's point of view: executing user commands). This kind of design can be kept intact even when the network I/O processing is moved into a separate computer.

2.3 Performance

Multithreading introduces performance penalties because of data contention. I think however that systems like MUDs should be thought in whole, for example only counting commands per seconds instead of testing performance with artificial benchmark code (though results of more "traditional" benchmarks should not dismissed alltogether). Another way to express performance could be the average response time to users but in the most cases this meter is more influenced by network delays than a driver design.

There is a strong relationship between commands/second and bytecode instructions/second in LPMuds. The commands/second rate can be made bigger by using several methods: make the interpreter faster, add more optimizations to bytecode generation and allocate more CPU-time for the interpretation of the bytecode.

There are some ways to modify the interpreter to be faster in expense of memory usage. Most currently used algorithms has been selected based on their memory consumption rather than their speed. However I think most speed improvements can be achieved by modifying the other parts of the driver that are utilized by the driver. Most direct optimization would be saving of has value of shared strings and utilization of that has value to other parts of the driver (for example to hashing of objects). There are also other parts of the driver that could be made faster.

There is always space for optimizations in LPC compilers. Additionally by adding better typechecking one can detect errors earlier, and in some cases make the code faster. The compiler can also be made more concurrent instead of the current iterative way. Support for saving bytecode programs onto disk should be added.

I think the best place for speed improvements is to give more CPU-time for the interpreter. Here threads comes in the picture. The interpreter should have a separate thread dedicated running bytecode. This means implicitly also that other parts of the driver have their own threads. Using a separate thread for the interpreter makes it possible to evaluate bytecode and to do I/O processing at the same time even in computers having a single CPU. There should be even bigger advantages in SMP machines and in configurations utilizing multiple machines.

3. Disadvantages Of Multithreading

3.1 Synchronization Overhead

Multithreaded code requires synchronization of access to shared resources. The synchronization degrades performance somewhat and there is not much one can do to avoid it.

3.2 Code Complexity

Multithreaded code can be more complex when compared with code intented run in a single thread. I think however that much of that complexity can be avoided by using a good and consistent design.

One way to avoid the code complexity problem is to use messages in communication. This decision however raises another problem: as the count of the messages goes higher managing them becomes more complex too.

4. Using Multithreading

4.1 Architecture

The architecture I have been thinking about would contain multiple threads and processes in multiple machines, see Figure 4.1: Multithreaded Architecture for the division of processes between two computers.

Figure 4.1: Multithreaded Architecture

In the picture load of the driver is distributed between two machines.

4.2 Tasks

4.2.1 Compiler

In the current LPMud implementations LPC compiler is an integral part of the game driver. I think this is not a good idea. It would better if the bytecode would be so well defined that compiler could be separated from the driver. This would also make possible to use a command line compiler outside a MUD.

Some modifications are needed into LPC language. The biggest change is visibility of variables defined in a object. They would no longer be available in derived objects, access to them must be done through a method call. This modification allows the compilation process to be changed from its current iterative form to support more concurrent compilation.

As addition of utilizing multiple processes and multiple machines in compilation the compiler would be multithreading. The compiler would implement following steps (after testing that they really make a compilation faster):

Preprocessing, lexing and parsing in two threads:
- Preprocessor and Lexer (1 thread)
- Parser (1 thread)
Typechecking
Optimization (N threads)
Code Generation (N threads)

Ie. there would two separate threads working during the first phase (the first produces tokens and the second one consumes them). The second phase would use only one thread to avoid need for synchronization and possible deadlocks. Optimization and code generation both could utilize multiple threads in a SMP machine because they do not need to do any syncronization if the steps are done per function basis.

The compiler would have no knowledge about efuns. Actually there wouldn't be any of them. There would instead be a new construct in LPC language: require "path"; (for example require "lpmud";). The require construct imports functions defined in a shared library into the object where they are accessed just like other functions. The require construct would ofcourse limited to trusted objects. Additionally there would be concept of "auto-object" as DGD has.

4.2.2 Interpreter

The interpreter is the most important part of the driver. Its purpose is evaluate bytecode instructions as requested by other parts of the driver. That's why it is a good place for optimizations.

The biggest modification to the interpreter is removal of efun opcodes. There is only two function call opcodes, one for calls inside a given object and another for calls to a method in an different object.

The internal object structure holds only minimal information required by the interpreter. Additionally a dynamically loaded shared library can attach data to an object. A library needs only to register an initialization and cleanup handlers (called at the creation and destruction times of the object, respectively).

4.2.3 Disk I/O

Synchronous disk I/O is a problem in performance point of view. Let's consider restore_object efun as an example. It does the following steps:

Check validity of arguments
Ask the master object to validate the request
Read a save file and process data in it
Set variable values in an object

Because of the synchronous nature of disk I/O, a CPU is partially idle during the file reading phase. That is not desirable if the mud is running on a dedicated computer. Instead better results can be achieved with asynchronous disk I/O processing, for example in the following way:

Check validity of arguments
Concurrently do following things:
- Read a save file and process it
- Ask the master object to validate the request
Wait the asynchronous request to finish
Either discard the read data or set variable values based on it

Above scenarion should be faster in both uniprocessor and multiprocessor machines, but it is not yet known how much. If using a SMP machine the speed difference should be even bigger if the disk I/O processing utilized caching of file contents. In case of an uniprocessor machine disk caching could however lower the performance (there would be two CPU bound threads competing on a single CPU).

4.2.4 Network I/O

All network connection management is moved into a separate process, possibly even in a separate machine. It is possible even have multiple processes to do user input processing. This architectural decision gives lots of future possibilities such as: adding SSL/SSH support (they would not eat CPU time of the actual driver), adding support for RMI based Java clients, moving user alias expansion to a separate process (again giving more CPU time to the actual driver), removing need to support all terminal types used by players from the mudlib (the network daemon can then make conversion to the actual type), etc.

4.2.5 Timed Calls

Timed call management is a good place to utilize threads. The currently available LPMud drivers use signals for timed calls: a signal handler sets a global flag that is then checked everywhere in the driver. A better solution is to use threads and thus removing need to check such global flag alltogether.

4.3 Communication

The basic communication is done with messages between the separate tasks. Everything can be thus done asynchronously, but also synchronous operations are possible by using blocking response wait function.

Figure 4.1: Communication Model

All service requests to a task must go through an API. Purpose of this requirement is to give possibility to distribute the services into multiple processes and perhaps even multiple machines. Figure 4.2: Communication Model gives an example how communication should be done. Figure 4.3: Future Communication Model shows to this communication model can be expanded in the future.

Figure 4.1: Future Communication Model

The IPC mechanism could be built on top of MPI which supports unicast, multicast and broadcast communications.

4.4 Utilities

4.4.1 Memory Management

Memory management can easily become a bottleneck, especially in programs where most memory is allocated dynamically as it is done in LPMuds. Also many internal errors of the game driver can be attributed to invalid pointers and other memory allocation related errors.

Both these problems can be solved or their impact can be significantly reduced with use of multiple heaps. This would make it possible for multiple threads to access memory allocation and deallocation routines. If that were enough even more concurrency support can be added inside the heaps.

For example, current struct object can be divided into two areas: fixed sized header and variable sized data (object's variables). The game driver can then have a heap dedicated for those object headers, for example at range 64MB-128MB in the virtual memory space of the driver. The net result would be that every pointer referencing an object must have value that is 0x04000000 + N * sizeof(struct object). Checking for validity of pointers would thus be easy and more importantly fast. Additionally allocating and freeing memory of objects would be fast too (just remove or add an object from list of free objects).

Not everything can be forced to be fixed size, for example there are big differences in lengths of strings. But in even variable sized memory blocks can be tracked by using same memory allocation algorithm as GNU-Malloc uses for small blocks.

4.4.2 String Management

A very important of a game driver is the string management, both in terms of speed and memory usage. Currently LPMuds share strings by employing a hash table and reference count. Unfortunetely the shared strings do save their hash value or length and they need to be recalculated every time these attributes are needed.

The first modification is to incorporate the hash values and lengths into shared string data as addition to the reference count. The second change is to modify all places where strings are used to utilitize those precalculated values. One such place is object hash table.

Another modification is to make the strings binary types. The strings would no longer be terminated by zero. Instead the length attribute would tell the end of the strings.

5. Related Work

5.1 DOME

The main idea of the DOME project is to provide execution environment that automatically handles resource and contention management on behalf of a programmer using the system. The system consists of multiple machines each having separate game driver communicating with each others. A syncronization support is included all drivers. [Weidt94]

There is big difference how DOME works and how I think a multithreaded/distributed LPMud should work. There are multiple computers and each of them run a separate game driver process in the DOME. The DOME system then invisibly to users and mudlib programmers handles access to objects distributed among the servers.

I am favoring a system where a game driver is divided tasks that can be distributed over multiple machines, instead of making multiple copies of the driver processes and adding object sharing between them.

5.2 Amoeba And Plan9

There are lots of common between my way of thinking and the Amoeba operating system. Both have division of different parts of the system into tasks, just like I'd like to divide LPMud driver into separate tasks. [Tanenbaum92]

A good example of task (called server in Amoeba's terminology) is the Bullet server in Amoeba. Its only job is to store and retrieve files from physical disks. It does no directory tree maintaining, only the file contents management. Because of that it is supposed to be very fast and its implementation is simpler.

6. Summary And Future Directions

6.1 Summary

There are real advantages of using threads in the game driver design.

6.2 Future Directions

The ideas expressed in this document are mainly focused of separating tasks into separate threads. The next logical step is to multiply these threads, especially add support for multiple interpreter threads. This is not a simple task as many problems raises if the concurrent execution is supposed to be done invisibly to programmers.

Glossary

API: Application Programming Interface.
IPC: Inter Process Communication.
MPI: Message-Passing Interface.
MUD: Multi User Dimension (also Multi User Dungeon).
MVC: Model-View-Control.
RMI: Remote Method Invocation.
RPC: Remote Procedure Call.
SMP: Symmetric Multi Processing.

References

[Weidt94]	Alexander Weidt. DOME - A distributed object oriented execution environment. Technical University Berlin, 1994 <URL:http://www.cs.tu-berlin.de/~demos/dome.html>
[Tanenbaum92]	Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall International, Inc., 1992.

Document History

Version	Revision	Date	Author	Comments
0.1	1.28	1998-10-15	Petri Virkkula	Initial unfinished version
0.2	1.32	1998-10-15	Petri Virkkula	Lots of text as well as two more figures added
0.3	1.36	1998-10-23	Petri Virkkula	Added more text

[ $Revision: 1.36 $ | $Date: 1998/10/23 20:18:32 $ ]