Epic Games has released lengthy “postmortem” of the trials and tribulations that were involved in the release of the Playground Limited Time Mode.
Originally released on June 27, the Playground Mode was easily the most highly anticipated LTM that has come to Fortnite Battle Royale so far. But, soon after its release, it was disabled.
This happened due to the Matchmaking Service (MMS) not being able to handle the extra strain, as Epic explains:
“Since Playground mode makes matches for every 1-4 people instead of 100, it requires between 25 and 100 times as many matches as normal depending on party size. While we could pack virtual servers a bit tighter per physical CPU for Playground mode, we still had to use 15 times as many servers as we had been running for the other modes. We were able to secure the total server capacity, but it meant the list that each node had to manage was suddenly 15 times as long as well,"
“When we released Playground, the overwhelming demand quickly exhausted the local lists for MMS nodes far faster than the system could refresh them. Each node was running to every other node to request extra servers that just weren’t there yet, or at the very least took a long time to pick out of the non-local lists. The long compute times caused the CPU to end up with a backlog of pending requests, resulting in a feedback loop that eventually caused the system to grind to a halt.”
It took Epic longer than they expected to fix the issues, but eventually the mode returned on July 2.
Epic states that the solution involved giving the Playground Mode its own service cluster and to then give that service cluster the ability to re-balance sessions from other nodes:
“Once we identified the root of the problem as the exhaustion of sessions from local lists, the solution was to give the cluster the ability to bulk rebalance sessions from other nodes to ensure repeated lookups were not necessary. With the system constantly shifting regional capacity from nodes with an excess to nodes that might be running low, the odds of a node running dry for a particular region and having to search outside its local list have been drastically reduced,
“We pushed the load-testing process to the limits during our MMS restructuring, because the scale of what we were trying to simulate was so far beyond normal usage or testing patterns. We needed to spin up many millions of theoretical users and hurl them at our Playground MMS system in a big, crashing wave in an attempt to strain our new session rebalancer. While the tweak - test - evaluate cycle took several hours per loop, it allowed us to develop and refine the rebalance behavior to a point where we felt it could stand up to the traffic, as well as to identify and fix edge-case bugs that could have torpedoed the effort to bring Playground back online.”
To conclude, the developers states that they learned a great deal about their own matchmaking system throughout the Playground Mode saga and say that they did not properly anticipate the rush of players that would want to play it.
“In short, we learned a lot about our own matchmaking system and its failure points as well. We planned and prepared for what we thought to be the maximum sustained matchmaking throughput and capacity based on the size of our player base (plus a healthy buffer), but didn’t properly anticipate the edge-case of of the initial “land rush” of players exhausting local lists,
“The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways. We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.”
The full postmortem can be found here.
NEW from Dexerto: Why 100 Thieves is the hottest brand in esports right now: