Thursday, February 1, 2018

A Data-Driven Strategy Guide for Through the Ages, Part 3

Index:

1. Introduction (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Current Article)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4) 
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

2. Data Analysis

2.2 Classification based on Key Cards played

Let us first recall this Figure from the previous section.
This shows how well we can predict the final outcome based on "infrastructure development" up to a certain round.  It grows monotonically as it should.  However, it grows at obviously different rates.  With the error-bars, we are certain that the slope between round 4 to 8 is almost half of the slopes before and after.  This implies that the status changes during these few rounds are less relevant to the final result.

Based on my experience with this game, there is a likely reason.  In order to develop any aspect of infrastructure, one must play certain cards.  Increasingly better cards become available in different stages of the game.  Thus, the biggest difference occur when players setup those cards.  That happens roughly at round 4 for State I cards, and round 8 for Stage II cards.  That is why those rounds have higher impact on the final result. Thus here, we will try to learn strategic lessons from the usage of those cards.

There are 88 such cards in the game.  We again go through the scraped data to get a 88-dimensional vector per-player-per-game.  Although we also recorded when and how are these cards played, we will not use those information yet. All we care now is whether a card is played, thus the value of each component of this 88-dimensional "Key Cards" vector is either 0 (not played) or 1 (played).

We repeat exactly the same procedure as in the previous section. An SVM with a linear kernel can classify "good" or "bad" performance based on the "Key Cards" vector.  We get a performance of 70% and a weight for each card.
Please do not squint at the above Figure.  This is not very useful due to the same reason that we do not want to use the last few rounds in the previous section. In the above Figure, a lot of cards with heavy weights are "big-late" cards.  They are played at the end of the game and costs a lot of resources and actions. By that time, if a player has those resources and actions to spare, most of the time that's already a "good" performance anyway. Therefore, they are just cashing in their lead.  These are good "indicators" to show that a player is doing well, but they are not the strategic reason why such player is good.

A more careful analysis is required to coax causation from the observed correlations.  For example, we should focus on cards which are played earlier in the game. Ideally, We should also compare cards with comparable opportunity costs.  Again, GO is a biased example that every single move has the same opportunity cost---another move.  In TtA, playing a card usually involves a combination of many types of resources, thus no card has exactly the same opportunity cost as another. Thus, we need to be more clever about "asking the right questions" here, and may need to cross-reference to some other statistics.

This does not mean that once we have selected the appropriate subset of cards to compare, we can then refer to that subset in the above Figure.  A classifier is trained to classify.  In some sense, it will use the best clue first, and then conditioned on that, it will consider less significant clues. (In a decision tree, this would have been exactly true. For a linear SVM, we can draw a 2-D example to convince ourselves that a similar effect is still present.) In this particular example, our classifier is effectively asking whether a player has played those "big-late" cards, and use that as the main guidance for classification.  It will then consider earlier cards with less weights to fine-tune the prediction.  This is similar to a conditional probability in the reverse-time order (condition on the result to compare the causes, also known as a post-selection effect). This is opposite to the usual strategic thinking, and will often lead to strange results at face value.

Let me give a concrete example here.  All those "big-late" cards cost a lot of resources.  However, all cards that help to produce more resources have small, or even negative weights in the Figure. This sounds weird, but it makes perfect sense for the classifier.  It first gets a correlation between winning and big-late cards. Then, if a player won without playing those big-late cards, she better not have invested too much in resource production.  This criterion will help the classifier to recognize those rarer winning situations. We cannot learn from the negative weights here to conclude that those resource-producing cards are bad in general.

In order to get a meaningful result for a subset of cards, we should train our classifier only on that subset.  Here is an example for all cards available early in the game.
First of all, the number on the top is the classifier performance. At 59%, it is comparable to using the data up to round 4 in the previous section.  Since "playing cards" is closer to individual moves than "improving infrastructure", that is quite a good news.  It also teaches a clear lesson.  One particular type of cards stands out in the above Figure. All the Leader cards have significantly higher weights than other types. This is consistent with advices from experienced players, and also consistent with the fact that leaders have strictly smaller opportunity cost than other cards, yet they often provide more benefits.

We will repeat this process for various other choices of subsets.  If the performance is higher than 54%, we will analyze the results in Section 3.2We will need to select those subsets carefully, and often need to supplement the result with other statistics to get meaningful lessons.

Sometimes, the true power of a card does not manifest alone.  Seasoned gamers usually expect combos--2 or more cards that combine to have dramatically better effects.  There is a particularly simple way to detect the existence of combos. We can train an SVM with polynomial kernel of degree X. If the performance turns out to be better than the linear kernel, that implies the existence of combos involving X or less cards.

Unfortunately, and a bit surprisingly, such situation has not come up yet during our analysis. Actually, the validation performance for nonlinear kernels are usually worse than the linear kernel. This is true even if we train on a subset of cards which seasoned players consider to have good combos. This is probably because that the chance for a player to get both cards in a combo is very small. A card has a 10%-30% chance to be played by someone. Thus a particular 2-card combo only shows up 4% of the time. Within 10k samples, there might be too much "noise" among those cases.

Note that this is not about over-fitting though. In fact, we have only talked about validation performances so far, but in all examples they are actually close to the in-sample error. Even with only 10k games, the VC dimensions of our models have always been small enough to avoid over-fitting. The "noise" here actually stops the classifier from recognizing any pattern associated with combos even in-sample. It is not very clear whether increasing the data set size will improve that.

No comments:

Post a Comment