SCHNAIL Ranked Play: How to evaluate AI strength

The current problem I’m working on with SCHNAIL is one of the most frequently asked question regarding the bots on the platform: How strong are these bots? We have pondered this over multiple episodes of the Undermind. The main problem with measuring this is that bots play each other, not humans. On the SCHNAIL platform, they only play humans, so their relative strengths are not shown. And even though most of them are taken from SSCAIT (which is exclusively bot vs. bot) at some point, they are not synchronized, and in the long run, I expect serious divergences. This has to do with the incentives (winrate vs. ELO), and scale. SSCAIT and BASIL has 24/7 plays – BASIL does it in headless mode, so a match takes only 2-3 minutes or so. This is not really comparable to SCHNAIL – if I would just take the bot ratings from them as seeds, they would be disproportional one way or the other. I decided I will start fresh, and try to measure bot strength on SCHNAIL with the express purpose of gauging human vs. bot skill levels.

Terminator (character) - Wikipedia
3/10, would not get exterminated again

First, I realized this will be a bumpy road. No matter what I implement, there will be quirks, exploits, bugs, murder, mayhem, and wailing of women. So let’s do a test run first – let’s call it Season 0. I want a fair competition from both sides.

Key design principles:

The scoring system: It will be just a form of ELO at start. We can make adjustments later, but ELO is built with 1v1 play in mind, so it’s sufficient. The base score will be 1500, so it’s comparable to most modern rankings. At any rate, it will be an appropriation, but it’s a start.

Matchmaking: Just like ladder play, the opponent is not known beforehand. The bot will be anonymized to prevent opponent-specific exploits. The opponent will be assigned semi-randomly, with your (and the opponent’s) rating kept in mind. The exact weights here are not finalized yet.

Bot versions and updates: Bot updates are allowed during the season – after all, humans change and adapt between matches, and updates can go either way – we’ve seen bot scores plummet on multiple occasions because of a buggy update. Bot authors are allowed to mark their bots as “practice only”, so they are not participating in ranked play. Important: If you change this, your ELO will be reset. It would be unfair to mark a bot practice-only, adding a bunch of updates, then re-enabling a now stronger bot.

That begs the question: Where do I test my bots? After all, the SCHNAIL environment is a bit different than your local dev setup. The answer is that there is a “dev mode” feature currently in progress. This will allow to add your bot directly in the client (only locally!), and test it in an environment close to the real thing.

Also, currently bots are only updated on startup. A feature will be added where if you press play, the bot version will be checked, and a new one downloaded before starting the match, so you never play an outdated bot.

Map pool: I have to make a compromise here, unfortunately. The map pool will be the SSCAIT map pool, because most bots are stable and working on those. Many bots will simply just not work on new maps – not that this is temporary. Since the bots were grabbed from SSCAIT, and not all authors are actively working on them anymore, this is only fair. This will surely change in later seasons, but the current bot pool has some inertia on this.

Accountability and anti-cheating: It is my belief that as soon as you put any arbitrary rating number in front of people, they will try to cheat and exploit the system to make that number go higher. The actual impact of that number is not really important.

All the ranked plays will be public. Not necessarily available to everyone, but searchable, and if cheating is suspected, they will be reviewed. There will be a number of additional anti-cheat checks (some of them already implemented), which I will not detail here for obvious reasons. Don’t get me wrong, this is not security by obscurity, I will discuss these measures with my fellow developers, I just don’t want to give script kiddies a head start.

Prizes and rewards: Just bragging rights for now (unless someone decides to sponsor it), as this is a test run. From the next season, I would like to add some. We also have some collaborations in progress, so nothing is decided yet.

Duration of Season 0: ¯\_(ツ)_/¯

Crashes and draws: Crashes will be counted as defeat from either side, along with exiting the game. The game is only counted if a replay is submitted, so if the game breaks before the actual match is started, it is discounted. When you press Play, the intent to start a game is recorded, so you can’t just disconnect and re-roll the game until you get what you want. After starting a game, a grace period is given to submit the replay. This will be something like 12 hours, so having a long game is not a valid excuse. If the replay is not submitted, the match is counted as a loss for the human player.

Draws are not possible.

This is all I have for now – I wanted to give an insight into the design process. I’d like to start this soon, but yet again, no deadline as of yet. Stay tuned, and until then:

Closing thoughts

As you might have guessed, a lot of groundwork is needed before ranked play can happen. I’m pretty sure we will discover some loopholes/exploits in the system as well. Many of the work needed to make this a reality is virtually invisible – various backend calls to ensure stability and accountability. But we are getting there!

Thanks for reading! Please consider supporting Making Computer Do Things, and SCHNAIL on Patreon – even 1$ buys us some server time! You can also follow on Facebook, Twitter, YouTube, or Twitch!

2 thoughts on “SCHNAIL Ranked Play: How to evaluate AI strength

  1. Jay Scott

    Instead of resetting a bot’s rating when it is updated, an alternative might be to use a Glicko style rating system which keeps track of the uncertainty in the rating: Leave the rating unchanged, but set the uncertainty to a large value on update.

Leave a Reply