Difference between revisions of "Networks Community Detection"
From OPOSSEM
Derek Ruths (talk | contribs) |
Derek Ruths (talk | contribs) (→Edge Betweenness) |
||
Line 34: | Line 34: | ||
==Edge Betweenness== | ==Edge Betweenness== | ||
− | One approach . | + | Let's take a moment to evaluate the implications of the idea that communities should be well-connected internally. This suggests that nodes that are not in the same community will have much fewer edges passing between them. On a global network perspective, this means that there will be relatively few edges passing between communities. If we could identify these edges passing between the communities, we could cut them and the nodes that remained connected to one another would form strong, well-connected communities. But how do we identify these edges? |
+ | |||
+ | One approach to identifying these edges involves observing that when there are few edges connecting two groups of nodes, all the shortest paths between nodes of one group and the other group must pass along these few edges. This sounds a lot like the idea of betweenness centrality [[Networks_Centrality|betweenness centrality]] in which we found a way of counting the fraction of shortest paths in the network that pass through a given node. Here, of course, we're interested in counting the number of shortest paths that pass through an edge rather than a node, but the idea is still the same. In fact, with only a small modification to the algorithm (which we don't cover here), we can compute the betweenness of every edge in the network. The edge with the highest betweenness has the most shortest paths in the network passing through it - which is precisely the edge we want to cut, since it's probably connecting two large communities of nodes. | ||
+ | |||
+ | By iteratively cutting the edges with the highest betweenness, we will eventually disconnect different well-connected groups of nodes in the network. We can take these different disconnected groups to be different communities and compute the modularity of this community assignment in the original network. | ||
+ | |||
+ | Of course, we can continue removing the highest betweenness edges until we've removed all the edges in the network. Along the way we'll find other sets of well-connected nodes. In this way, the edge betweenness community detection both identifies a number of promising community assignments and has a very dependable stopping condition (when all the edges have been removed). | ||
+ | |||
+ | While betweenness community detection works well on a number of problems, it has several drawbacks. First, because it must compute and recomputed edge betweenness, it is very computationally expensive on networks larger than several thousand nodes. However, it also suffers from the fact that betweenness is defined in terms of shortest paths rather than all paths. The problem with considering shortest paths is that this measure can make some edges look much more important simply because they are on a slightly shorter path between two parts of the network. For this reason, other methods have been investigated - several of which outperform betweenness community detection both in terms of speed and accuracy. | ||
==Label Propagation== | ==Label Propagation== |
Revision as of 07:57, 9 July 2011
Contents
Objectives
- Understand the difference between graph partitioning and community detection
- Survey the different methods available for community detection
Introduction
Algorithms for finding communities have many uses in network science, particularly in the social sciences. Since collective action is such a fundamental aspect of social and political behavior, identifying the groups to which individuals belong is an important task. Here we discuss a variety of methods for approaching this problem.
In network science, community detection amounts to assigning each node a community 'label'. For example, supposing that the network contains three communities, each node would be assigned a label of 1, 2, or 3 - indicating whether it belongs to community 1, 2, or 3. Community 1, then, consists of all the nodes that have been assigned the label '1', the same holds for communities 2 and 3.
Note that, given this setup, it's very easy to invent an arbitrary assignment of nodes to communities: for each node in the network, randomly select a community label (either 1, 2, or 3), and assign that label to the node. Of course, it's unlikely that any arbitrary assignment will be a good assignment, meaning that all the nodes labeled 1 actually correspond to a community in the real system. This means that we'll have to try a lot of different community assignments in order to find a good one - which is the crux of the community detection problem: generating different assignments of nodes to communities, evaluating how good the assignments are, and stopping when we find one we're happy with. As you can imagine, each of these steps is actually a non-trivial problem. Here we're going to look at ways of doing each.
Before delving into the different existing methods, let's first discuss why each of these problems is hard.
- Assessing the quality of a community assignment: despite much research in sociology, psychology, and computer science, we still don't have a definitive, quantitative answer to the question: what is a community. Nonetheless, we need some way of evaluating how good a community assignment is at grouping nodes into real communities. In network science, the focus has been on using the connectivity of the network to evaluate the quality of an assignment, operating on the assumption that in a real community nodes will be more connected than would be expected at random. This is what the modularity measure tries to capture - discussed below.
- Generating different assignments of nodes: as mentioned earlier, it's easy to generate a random assignment. However, we're looking for good assignments. The naive approach to generating different assignments is to simply look at every possible community assignment - if we do this, then we're guaranteed to find the right one. Unfortunately, this is computationally intractable. Consider a network of 100 nodes which we want to separate into 2 communities. There are exactly <math>2^{100}</math> different possible assignments - even if we can test 1000 assignments a second, it will still take more than <math>40,196,936</math> billion years to try all assignments. Clearly, we'll need to come up with a better way of identifying promising community assignments.
- Knowing when to stop looking: since evaluating all possible community assignments is clearly not an option, we not only need to know how to pick promising assignments, but we also need to know when to give up looking for a better assignment. Without a stopping condition, we could look for a long, long time, which would defeat the purpose of being selective about picking community assignments in the first place. Typically the way of generating community assignments and identifying a stopping point go hand-in-hand. They are typically aspects of the community assignment search problem, which is discussed in more detail below.
Modularity
Consider the patterns of connectivity within a community of people. What kinds of structural features might we expect to observe among these individuals? On plausible feature is that there is a higher degree of connectivity (more edges) among these individuals than we would observe for any random collection of people. This is the intuition that modularity captures.
Given an assignment of nodes to communities, modularity will return a value between -1 and 1 indicating whether there is less (< 0) or more (> 0) connectivity among the individuals belonging to the communities that would be expected at random. In community detection, we're seeking an assignment of nodes to communities that maximizes the modularity of the network - since an increase in modularity indicates that we have found labeling that increases the connectivity within the communities.
Searching Community Assignments
As mentioned above, the number of possible community assignments for a network is too huge to permit us to test each one. This means that we have to only test promising community assignments. In order to enable the computer to do this work for us, we have to design algorithms that can quickly identify community assignments. Also mentioned above, these algorithms also need to know when to stop such that it's pretty sure that it's already seen and evaluated the best community assignment. In this section, we'll discuss a number of different methods for doing this bounded search.
Graph Partitioning
The most basic way to do community detection is to (1) tell the computer how many communities it should look for and then (2) for the computer to go find a way of assigning nodes to that number of communities that maximizes the modularity of the network. A number of methods exist that do this including spectral partitioning and the Kernighan-Lin algorithm.
However, this approach has a serious conceptual problem: you must know the number of communities you are looking for. In general, this will not be the case when working with a social network. Furthermore, even if one knows that there are, for example, two large groups present, it still may not be advisable to restrict the algorithm to searching for these two communities only, since more interesting phenomena and social structure might exist on another, finer level.
As a result, while partitioning algorithms exist, other algorithms that can discover arbitrarily many communities are typically more useful. We turn our attention to these in the next sections.
Edge Betweenness
Let's take a moment to evaluate the implications of the idea that communities should be well-connected internally. This suggests that nodes that are not in the same community will have much fewer edges passing between them. On a global network perspective, this means that there will be relatively few edges passing between communities. If we could identify these edges passing between the communities, we could cut them and the nodes that remained connected to one another would form strong, well-connected communities. But how do we identify these edges?
One approach to identifying these edges involves observing that when there are few edges connecting two groups of nodes, all the shortest paths between nodes of one group and the other group must pass along these few edges. This sounds a lot like the idea of betweenness centrality betweenness centrality in which we found a way of counting the fraction of shortest paths in the network that pass through a given node. Here, of course, we're interested in counting the number of shortest paths that pass through an edge rather than a node, but the idea is still the same. In fact, with only a small modification to the algorithm (which we don't cover here), we can compute the betweenness of every edge in the network. The edge with the highest betweenness has the most shortest paths in the network passing through it - which is precisely the edge we want to cut, since it's probably connecting two large communities of nodes.
By iteratively cutting the edges with the highest betweenness, we will eventually disconnect different well-connected groups of nodes in the network. We can take these different disconnected groups to be different communities and compute the modularity of this community assignment in the original network.
Of course, we can continue removing the highest betweenness edges until we've removed all the edges in the network. Along the way we'll find other sets of well-connected nodes. In this way, the edge betweenness community detection both identifies a number of promising community assignments and has a very dependable stopping condition (when all the edges have been removed).
While betweenness community detection works well on a number of problems, it has several drawbacks. First, because it must compute and recomputed edge betweenness, it is very computationally expensive on networks larger than several thousand nodes. However, it also suffers from the fact that betweenness is defined in terms of shortest paths rather than all paths. The problem with considering shortest paths is that this measure can make some edges look much more important simply because they are on a slightly shorter path between two parts of the network. For this reason, other methods have been investigated - several of which outperform betweenness community detection both in terms of speed and accuracy.
Label Propagation
Conclusion
References
Discussion questions
Problems
Glossary
- [[Def: ]]
- [[Def: ]]
- [[Def: ]]