The Process of Converting a Couch Co-Op Game to LAN Multiplayer

This post will describe the architectures I went through to convert Aigilas from a local co-op game to support LAN play.

Preparation

Before facing this beast head-on, I read through a number of articles and blog posts. What follows are links to a sampling of what I read to gain insight into this problem. Literature that was the most helpful to me is listed first.

http://buildnewgames.com/real-time-multiplayer/ - Inspired by the networking development docs released from Valve. Talks about different means of running a single authoritative server while keeping clients (mostly) in sync.

http://www.gabrielgambetta.com/?p=11 - Four part series that gives a graphical overview of the topics explained in the link from buildnewgames.

http://www.gamasutra.com/view/feature/3094/1500_archers_on_a_288_network_.php/ - The first article I read on building a multiplayer game. The story is an enjoyable read and is full of nuggets that can only be gleaned from the experience of building a AAA multiplayer game.

http://www.jenkinssoftware.com/raknet/manual/multiplayergamecomponents.html - This man page from RakNet game me a better idea of Google search terms throughout the conversion process.

http://en.wikipedia.org/wiki/Dead_reckoning - One of the “de-facto” algorithms in multiplayer networking.

http://www.mmorpg.com/discussion2.cfm/post/2228522#2228522 - By far the closest post I could find to a layman’s explanation of dead reckoning.

http://forums.create.msdn.com/forums/t/13354.aspx - My first encounter of a library called Lidgren

http://msdn.microsoft.com/en-us/library/bb975645%28v=xnagamestudio.31%29.aspx - Official sample that details the basics of setting up a multiplayer game.

First Attempt

The game itself has serialization issues, so sending the entire game state didn’t seem feasible as a path of attack. After reading some of the articles listed above, I decided on the following approach.

The first game launched would host a server in its own thread
Each game launched after that would become a client
The server would seed the RNG for each client with the same seed
The server would communicate the state of keyboards/gamepads to each client throughout the game’s duration

During this phase, I was trying to wrap my head around the best practices for using Lidgren. How to properly start a server and client in separate threads wasn’t clear to me. Due to my own ignorance, a number of Socket related exceptions prevented me from moving forward.

I used raw byte arrays to communicate and casted the needed values into them to make communication light weight.

Second Attempt

Those socket exceptions were agitating me to no end. I stopped the effort of splitting each component into its own thread to get a better grasp on Lidgren. Instead, I ran the client and server using an Update() loop called from the game’s application loop.

Overall, this approach was very successful. Clients appeared to update synchronously and there was no visible lag. Victory seemed close, but during some play testing it was revealed that the game wasn’t in sync after the second level of the dungeon.

Third Attempt

The clients were already using synced random number generators. I tried to overcome the syncing issues by pre-loading the entire dungeon and then starting the game simultaneously on each client. Playtesting showed this to be an effective way of keeping clients in sync. That was true, until we saw that in some cases a player would wind up in two different locations between clients. Enemies base much of their AI on the position of players. With a player being in two different spots there were large differences between two clients connected to the same game.

Fourth Attempt

How was this happening? I was starting to pull my hair out at this point and wondered if I had bitten off more than I could chew. Converting a local co-op game to LAN is a textbook case of “don’t even try” from everything I’d read and heard.

Rather than throw in the towel, I rolled up my sleeves and went back to the white board. Debugging revealed that the state of a player’s inputs could be perceived as different by two separate clients. Each client stored a cache of player input states and updated that cache whenever a change was made by either client. Although that should work in theory, it was very easy to get two cache’s slightly out of sync at just the wrong moment. Suddenly, the same character is in two different positions.

Removing the cache appeared to be the only viable option. Input would be detected through a synchronous query against the server. To keep things synchronous, the server needed to be running in its own thread and not be dependent upon the game’s application loop.

Fifth Attempt

Synchronous messages between clients and the server now kept all players in sync. A great victory was on the horizon. Play testing began…and then another roadblock. The players were perfectly in sync, but the enemies were still able to make different decisions on each client. Perhaps one was updating more frequently than the other?

Sixth Attempt

Why weren’t the monsters in sync? It turns out that keeping players in sync was only cosmetic. Checking the logs revealed that a player could sometimes receive a movement when another client did not receive that message. I refactored the server and client to use some non-garbage generating custom serialization logic for message contents. Polling the server for every input check was far too many messages per second. The solution worked, but only if the throttle was large enough that the game was effectively unplayable.

To lower the amount of communication per second, I added the cache back to the client. I also changed the server to sleep when it wasn’t in use to avoid CPU thrashing. Finally, I converted each client to only update the game logic after the server sent the current state of player input. The player input state was cached on the client, but was only used for a single turn of the game. By the next turn, any changes in input would again be sent to every client. Lag free play was obtainable for any number of clients so long as the frame rate of the game was reduced to 24 FPS. This was an acceptable solution to me.

It is certain now that players never go out of sync. However, after a bit of play-testing it was revealed that monsters go out of sync fairly early on and that particle animations almost never match up. I would not have cared about particle animations being out of sync, but it was a sign that things were not yet working properly.

Seventh Attempt

The next architecture I approached was to have the server only issue permission for clients to run a turn simulation after each client had “checked in” after processing the previous turn. Whenever this permission was granted, the current input state from each player would also be distributed. At any point, clients could update their input status to the server. This meant that button presses still had near instant feedback while game updates could be throttled from the server. This approach appeared to be successful at first.

Eighth Attempt

After play testing with the seventh architecture on two separate computers it was revealed that the clients would easily get out of sync. Through lots of trial and error it was revealed that clients were not actually running in lockstep. Anytime that one thread involved with the game would pause it was possible for a client to “get ahead” of the other clients. To amend this I forced the rule that during each input sync phase the clients would each re-seed their random number generators. After doing some more work to ensure that clients could only run one turn at a time without getting ahead of another client this architecture proved successful. There are undoubtedly a number of optimizations that will be needed to make the game run fluidly on all clients, but the difficult task of keeping clients in sync appears to be finished.

Wrap Up

All in all, this has been an amazing month for Aigilas and a great boost to my morale. I cannot give enough credit to the Lidgren library. It has a very clean API that requires minimal amounts of understanding to get the ball rolling. That simplicity is a sign of the power it affords me as a developer; it is extremely flexible. All of the architectures I tried were built on Lidgren and it handled each one like a champ.

My biggest take away from the past month occurred after completing the seventh architecture. I believed that the work was complete at that point and enjoyed that entire day on a high note. Little did I know that I would come crashing back down later that night during the first live LAN test. It was enough of a crash that I stopped working on Aigilas and thought about dropping the project altogether. Lying in bed, I considered the ramifications and fell asleep while still unsure about the game’s future.

It took a day away from the project and an obsessive love of this game to keep through to the end of an exhausting task. With this finally finished I can continue with a handful of technical enhancements and then move on to implementing some more gameplay features.

I have never been more excited about this game than I am right now.