Project Description

Data mined and processed:

a simple python script with requests and BeautifulSoup libs had been crawling for couple weeks to mine 2M gamers, and games they played in steam community network. The data was stored in JSON format and transformed in to csv for importing in R

# read user_game data
games = read.csv("~/temp/mining/user_game_info_total.csv", header = TRUE)
games["appid"] = factor(games$appid)
summary(games[c("user_id", "game")])
cat("number of games:", length(unique(games$appid)))
cat("number of users in games:", length(unique(games$user_id)))
##           user_id                                       game         
##  nagahensem   :    2450   Team Fortress 2                 :  606617  
##  nlssosuck    :    2370   Dota 2                          :  524672  
##  ohmwreck3r   :    2332   Counter-Strike: Global Offensive:  457751  
##  GIPress2     :    2320   Left 4 Dead 2                   :  348505  
##  charlesonyett:    2316   Garry's Mod                     :  348279  
##  (Other)      :25853993   Counter-Strike: Source          :  343942  
##  NA's         :       4   (Other)                         :23236019

Methods:

Compute common users for each pair of games.

There are 5000 game titles in the dataset and 2M users in games, computing the total common users between 25M pairs of games would take forever to finish in a my old laptop. And because the data I currently have is just a small part of the entire steam community, there's a need for a scalale solution as well. Fortunately we have Hadoop Map-Reduce:

{“usr_id”:“123”, {“games”:{“A”, “B”, “C”}}
           | 
           |Mapper
           --------> A, B, 123   |
                     B, C, 123   | Sorting 2 primary key fields
                     A, C, 123   V
                       |
                       |Reducer
                       --------> A, B, len(unique(userids))
                                 A, C, len(unique(userids))

It took approximately 40 normalized instance hours to run entire dataset (Amazon Elastic MapReduce) with the complexity BigO of N(users)*avg(game)2

# read graph
data = read.table("~/temp/mining/hadoop/comm_results")
names(data) = c("game1", "game2", "total.common.user")
head(data)
##   game1  game2 total.common.user
## 1    10  10080               184
## 2    10  10150              2778
## 3    10  10220               395
## 4    10 102600             10162
## 5    10 102810                 1
## 6    10 104000               536

Define similarity scores:

The number of common users between games was a good indicator to show how similar games are. However, this absolute value doesn't treat every games equally. For the less popular games which only have less than thousand players, the overlapping areas between itselve and other games will be smaller than those of a more popular game. To reduce these biases, a normalizing step was needed.

simFomula

alt text

Venn diagram for 3 games

# similarity scores will be assigned to weight
E(graph)$weight = E(graph)$weight/(E(graph)$total.x + E(graph)$total.y)

Community Detection – 1st question

Fast greedy modularity, and Short random walks are 2 algorithms are appropriate for this type of large scale, and weighted edges graph (flow based alogrithm like Infomap doesn’t work for this dense graph):

  Graph community structure calculated with the fast greedy algorithm
  Number of communities (best split): 4 
  Modularity (best split): 0.1464766 

  Graph community structure calculated with the walktrap algorithm
  Number of communities (best split): 7 
  Modularity (best split): 0.1143453 
# community detection
fastgreedy.comm = fastgreedy.community(filtered100.graph, weights = E(filtered100.graph)$weight)
walktrap.community.comm = walktrap.community(filtered100.graph, weights = E(filtered100.graph)$weight)
summary(filtered.mapper[c("game.title", "fastgreed.cluster", "walktrap.cluster")])
##                   game.title   fastgreed.cluster walktrap.cluster
## Aion                   :   2   1:   3            1:754           
## Aion Collectors Edition:   2   2:2340            2:306           
## All Points Bulletin    :   2   3:1105            3:775           
## Arma 2                 :   2   4:  14            4:557           
## Crysis 2               :   2                     5:164           
## Darksiders             :   2                     6:505           
## (Other)                :3450                     7:401  

Visualization – 2nd question

See D3js visualization in these links below:

Fast Greedy clustering

Walktrap clustering

Discussion:

There are an noticeable difference in the results of 2 clustering algorithms, walktrap clustering algorihm seems to devide the population more equally than that of the fastgreedy algorithm. Smalling sizes of group 1 and 4 in the fast greedy results are very interesting topic for a further research to investigate the role of these nodes/groups in the game population.

games[fastgreedy.comm$membership == 1, ]
##       appid nodeid                      game.title count fastgreed.cluster 
## 37      563     37   Left 4 Dead 2 Authoring Tools   837                 1 
## 40      629     40 Portal 2 Authoring Tools - Beta   614                 1 
## 1652 107110   1652                  Bastion - Demo   200                 1 

games[fastgreedy.comm$membership == 4, ]
###                                           game.title count fastgreed.cluster
###                 Making History: The Calm & The Storm  1561                 4
###                                              RACE 07  5462                 4
###                               Clive Barker's Jericho  3007                 4
###                            Prison Tycoon 3: Lockdown   618                 4
###                                   Insecticide Part 1  1444                 4
###            King Arthur II - The Role-playing Wargame  1888                 4
###                                       Puzzler World    964                 4
###                            Diner Dash: Hometown Hero   288                 4
###                   Dream Chronicles: The Chosen Child   167                 4
###                                             Zenerchi   130                 4
###                    RACE 07 - Formula RaceRoom Add-On  4299                 4
###                                Dark Fall: Lost Souls  1401                 4
###  The Bureau: XCOM Declassified - Light Plasma Pistol   443                 4
###                                               Impire   783                 4

Conclusion:

With this type of visualization, gamers would be able to identify games that they are interested in without browsing genre categories from steam game store. Even without having so many players, games would be found by others through the similarity links from the chord diagram.