Malrio

Chang Gao (cgao7); Liyan Chen (liyanc); Bruce Tan (junyangt)

Video

Project Summary

Our project is a Super Mario Maker™ gameplay simulation in Minecraft, including two pieces of mechanism: a Mario player and a world map generator. We also provide an original physics engine to simulate actions of Mario.

The player is a partially observable agent. For each time step, it perceives a visible frame from the world map and makes a decision for the next action. We provides both deep-learning based AI and A-star searching AI for the agent. Manual control through keyboard is also implemented.

The generator will randomly generate a Minecraft world similar to maps in Super Mario Bros. We provides both simple obstacles generator and Prim map generator. Each map is associated with a goal - red mushroom - for Mario to reach.

The ultimate goal of this project is to train the Malrio agent with reinforcement learning initialized by supervised learning so that, given a random world, it can play the Mario to the goal by making decisions based on perceptions.

Approaches

Part I: Environment setup and Physics Simulation

World representations: In the physics simulator, we attempt to simulate a Mario world inside Minecraft:
- Brick: It simulates ground or unbreakable brick in Super Mario Bros. All bricks are colored brown in Malrio, and are unbreakable. When hitting a brick with feet, Mario will land on it. When hitting a brick with head, Mario will fall back. When hitting a brick with each side of the body, Mario will stop.
- Lava: It simulates items that will kill Mario, such as pits or lava in Super Mario Bros.
- Mushroom: It simulates the goal flag in Super Mario Bros.
Control and collision: Since Minecraft physics engine has its own rules, we create a separate physics engine in python including Newtonian mechanical dynamics simulation and rigid body collision resolution. Simulation results are sent to Malmo for each frame. One of the core simulation mechanism is a time variant linear system, where a 3 by 3 matrix represents the Newtonian mechanical dynamics of Mario: $\begin{bmatrix} X & v_x & a_x \\ Y & v_y & a_y \\ Z & v_z & a_z \\ \end{bmatrix}$ , where $X, v, a$ denotes displacement, velocity and acceleration respectively.
- Status Update: For each time step $\Delta t$ , a matrix multiplication would give the next state by preserving the following equations: $v_{t+\Delta t} = v_t + a_t \Delta t$ , $X_{t+\Delta t} = X_t + v_t \Delta t + \frac{1}{2} a_t \Delta t^2$ . Parameters of the original SuperMarioBros physics are replicated to reproduce the authentic controlling style of it.
- Actor control: We support a group of actions and action combinations. Like the “LFAB” buttons of original Super Mario Bros game, our control design can make 6 actions: jump, left move, right move, jump with left move, jump with right move, remain. Implementation of actions are achieved by manipulating specific velocity and accelerations of the system.
- Collision: If Mario collide with the ground, y-velocity will be cancelled; if collide up to a brick, y-axis velocity will be inverted; if collide to bricks in sides, x-axis velocity will be cancelled.

Part II: Datasets collection for supervised training

The dataset has pairs of frame-action correspondences. For each pair, a visible frame is cropped and an action is sampled.

Map Generation: Before a set of frame-action correspondences is collected, an unique world map is generated for producing the set. Maps are generated from simple obstacles generator or Prim map generator. The simple obstacle generator generates a world with randomly generated blocks and pits patterned like Level 1-1 in Super Mario Bros. The Prim map generator generates a maze-like world with Prim’s algorithm, which is harder to solve due to the requirements of spatial reasoning.
Visible Frame Crop: A square region around Mario is cropped as the perception to agents at a particular time.
Action Labels: An A-star search with global perception gives a plausible route from the beginning to the end, which consists of a series of actions at each time step. With the one-to-one mapping relationship to time step, action series can be combined with corresponding visible frames to build frame-action pairs.
Action Downsampling: With the observation that human players usually cannot change actions at every 60 frames in second, we introduce a prior of action downsampling. During a period of time, agents (including a-star search) can only change its action every 5 frames, and gaps between different actions are filled with “remain” actions. This technic is used throughout the whole system. In current design, frames in the gap are discarded and only frames with action labels are used to build the dataset. However, we are planning on building 5frames-action pairs in later development to give temporal information to advanced agents.

Part III: Supervised Training

The basic end-to-end neural network modal is a stack of cnn layers trained with supervised training.

Input: A visible frame at a particular time is defined as a 15 blocks x 15 blocks region around Mario. Therefore, the frame can be represented as a square grid with one-hot encoding, where each cell has 4 features with values of [0, 1], indicating the presence of obstacles, Mario, coins, and enemies. Since the position of Mario is not necessarily aligned with block boundaries, a sampling scheme is used to provide finer details. In particular, each block is represented by 4 x 4 pixels, and the presence of a bounding box would result the feature of covered pixels to be turned on. And ground truth label from the dataset is encoded as 6 dimensional vector using one-hot encoding, where each dimension represents an action. When an action is labeled, the corresponding feature will be turned on to 1, and other features will be turned off to 0. Since the simple model only considers 1frame-1action mapping, inputs and labels can be represented as $X \in R^{b,60,60,4}, Y \in R^{b,6}, \text{where b is the batch size}$ .
CNN Blocks: There are two kinds of basic Convolutional Neural Network units used in the model. The first one is a nonlinear stack of convolution, batch normalization, activation function(relu). The second one is a affine linear stack of convolution, batch normalization. Since the batch normalization will shift the covariant statistics within a batch, additional bias would be redundant. The reason for using affine linear cnn layers is that we adopted some ResNet-like structure. Therefore, introduction of nonlinearity into the identity channel would corrupt the residual learning. In addition, we adopted inception-like structures as well. Like inception-resnet-v2, each cnn block is built on several branches of cnn layers with various kernel sizes. For design details, please refer to src/Supervised/cnn_units.py file.
Prediction and Loss: After stacking CNN blocks, a global average pooling is used to produce a vector. The vector is passed through several small fully-connected layers to generate logits. And final prediction is given by softmaxing the logits. Since the labels and prediction are designed to be a probability distribution over actions, the loss function is defined as the Kullback–Leibler divergence(cross-entropy) between two distributions.
Training: The model is trained with adam gradient descent optimization to minimize the loss function.

Evaluation

A-star: Given enough time, the a star search is able to give a feasible solution. However, it usually takes too long. Thus, we only use a star search to generate examples on small maps to give examples to the cnn model.
CNN: When trained with the dataset, the prediction accuracy is usually around 80% - 90%. Since we have more data points than the parameters in CNN, it largely prevents overfitting and the test accuracy is very close to the training accuracy. In randomly generated maps, the success rate of CNN model is around 37%. The most frequent failure is getting stuck in corners.

Remaining Goals and Challenges

Currently, our prototype is limited to small maps only, and our deep learning models are still primary. For remaining weeks, we plan to

Revise our deep neural networks. We will improve quality and accuracy of our current convolutional neural networks, and will also examine on other models. The baseline is able to solve 50 percent of maps solvable by A-star search in our datasets.
Introduce more complex structures in the model, including Region Proposal Network(RPN), ConvLSTM, and DNC.
Generate more complex map datasets. The baseline is Level 1-1 in Super Mario Bros.
If time permits, change to q-learning and double q-learning.

By the time our final report is due, we should be able to achieve an intelligent AI that could solve around 50 to 90 percent of maps of our dataset, where maps are at least as complex as Level 1-1 in Super Mario Bros. The major difficulty lies in the model training and supervised label generation (since it directly determines the training result). Also, there could be still bugs or foibles in our physics engine.

To deal with them, we will divide our work into three major parts: model training, dataset generating and system revision, and assign them to different team members. Since our code is written in greatest modularity possible, it is easy to cooperate by providing functions and interfaces as negotiated.