Actions

Difference between revisions of "Networks Community Detection"

From OPOSSEM

(Edge Betweenness)
Line 9: Line 9:
 
Algorithms for finding communities have many uses in network science, particularly in the social sciences.  Since collective action is such a fundamental aspect of social and political behavior, identifying the groups to which individuals belong is an important task.  Here we discuss a variety of methods for approaching this problem.
 
Algorithms for finding communities have many uses in network science, particularly in the social sciences.  Since collective action is such a fundamental aspect of social and political behavior, identifying the groups to which individuals belong is an important task.  Here we discuss a variety of methods for approaching this problem.
  
From the outset, we are faced with the problem of articulating a good, quantitative definition of what a community is.  Since the network primarily encodes the structure of the community, this is where we focus our attention.  ''Modularity'', discussed in greater detail next, is the operational definition of what a good community is and, therefore, serves as the basis for nearly all community detection algorithms.
+
In network science, community detection amounts to assigning each node a community 'label'.  For example, supposing that the network contains three communities, each node would be assigned a label of 1, 2, or 3 - indicating whether it belongs to community 1, 2, or 3.  Community 1, then, consists of all the nodes that have been assigned the label '1', the same holds for communities 2 and 3.
 +
 
 +
Note that, given this setup, it's very easy to invent an arbitrary assignment of nodes to communities: for each node in the network, randomly select a community label (either 1, 2, or 3), and assign that label to the node.  Of course, it's unlikely that any arbitrary assignment will be a ''good'' assignment, meaning that all the nodes labeled 1 actually correspond to a community in the real system.  This means that we'll have to try a lot of different community assignments in order to find a good one - which is the crux of the community detection problem: generating different assignments of nodes to communities, evaluating how good the assignments are, and stopping when we find one we're happy with.  As you can imagine, each of these steps is actually a non-trivial problem.  Here we're going to look at ways of doing each.
 +
 
 +
Before delving into the different existing methods, let's first discuss why each of these problems is hard.
 +
* ''Assessing the quality of a community assignment'':
 +
* ''Generating different assignments of nodes'': as mentioned earlier, it's easy to generate a random assignment.  However, we're looking for ''good'' assignments.  The naive approach to generating different assignments is to simply look at every possible community assignment - if we do this, then we're guaranteed to find the right one.  Unfortunately, this is computationally intractable.  Consider a network of 100 nodes which we want to separate into 2 communities.  There are exactly <math>2^{100}</math> different possible assignments - even if we can test 1000 assignments a second, it will still take more than <math>40,196,936</math> billion years to try all assignments.  Clearly, we'll need to come up with a better way of identifying promising community assignments.
 +
 
 +
In order to detect communities, we are faced with two fundamental challenges.  First, we must articulate a quantitative definition of what a community is.  Since the network primarily encodes the structure of the community, this is where we focus our attention.  ''Modularity'', discussed in greater detail next, is the operational definition of what a good community is and, therefore, serves as the basis for nearly all community detection algorithms.
  
 
=Modularity=
 
=Modularity=

Revision as of 07:15, 9 July 2011


Objectives

  • Understand the difference between graph partitioning and community detection
  • Survey the different methods available for community detection

Introduction

Algorithms for finding communities have many uses in network science, particularly in the social sciences. Since collective action is such a fundamental aspect of social and political behavior, identifying the groups to which individuals belong is an important task. Here we discuss a variety of methods for approaching this problem.

In network science, community detection amounts to assigning each node a community 'label'. For example, supposing that the network contains three communities, each node would be assigned a label of 1, 2, or 3 - indicating whether it belongs to community 1, 2, or 3. Community 1, then, consists of all the nodes that have been assigned the label '1', the same holds for communities 2 and 3.

Note that, given this setup, it's very easy to invent an arbitrary assignment of nodes to communities: for each node in the network, randomly select a community label (either 1, 2, or 3), and assign that label to the node. Of course, it's unlikely that any arbitrary assignment will be a good assignment, meaning that all the nodes labeled 1 actually correspond to a community in the real system. This means that we'll have to try a lot of different community assignments in order to find a good one - which is the crux of the community detection problem: generating different assignments of nodes to communities, evaluating how good the assignments are, and stopping when we find one we're happy with. As you can imagine, each of these steps is actually a non-trivial problem. Here we're going to look at ways of doing each.

Before delving into the different existing methods, let's first discuss why each of these problems is hard.

  • Assessing the quality of a community assignment:
  • Generating different assignments of nodes: as mentioned earlier, it's easy to generate a random assignment. However, we're looking for good assignments. The naive approach to generating different assignments is to simply look at every possible community assignment - if we do this, then we're guaranteed to find the right one. Unfortunately, this is computationally intractable. Consider a network of 100 nodes which we want to separate into 2 communities. There are exactly <math>2^{100}</math> different possible assignments - even if we can test 1000 assignments a second, it will still take more than <math>40,196,936</math> billion years to try all assignments. Clearly, we'll need to come up with a better way of identifying promising community assignments.

In order to detect communities, we are faced with two fundamental challenges. First, we must articulate a quantitative definition of what a community is. Since the network primarily encodes the structure of the community, this is where we focus our attention. Modularity, discussed in greater detail next, is the operational definition of what a good community is and, therefore, serves as the basis for nearly all community detection algorithms.

Modularity

Consider the patterns of connectivity within a community of people. What kinds of structural features might we expect to observe among these individuals? On plausible feature is that there is a higher degree of connectivity (more edges) among these individuals than we would observe for any random collection of people. This is the intuition that modularity captures.

Given an assignment of nodes to communities, modularity will return a value between -1 and 1 indicating whether there is less (< 0) or more (> 0) connectivity among the individuals belonging to the communities that would be expected at random. In community detection, we're seeking an assignment of nodes to communities that maximizes the modularity of the network - since an increase in modularity indicates that we have found labeling that increases the connectivity within the communities.

Graph Partitioning

The most basic way to do community detection is to (1) tell the computer how many communities it should look for and then (2) for the computer to go find a way of assigning nodes to that number of communities that maximizes the modularity of the network. A number of methods exist that do this including spectral partitioning and the Kernighan-Lin algorithm.

However, this approach has a serious conceptual problem: you must know the number of communities you are looking for. In general, this will not be the case when working with a social network. Furthermore, even if one knows that there are, for example, two large groups present, it still may not be advisable to restrict the algorithm to searching for these two communities only, since more interesting phenomena and social structure might exist on another, finer level.

As a result, while partitioning algorithms exist, other algorithms that can discover arbitrarily many communities are typically more useful. We turn our attention to these now.

Edge Betweenness

Recall that modularity tells us how good a particular division of the network into different communities is. What modularity cannot tell us, however, is whether we can improve that particular assignment of nodes to communities.

Despite having modularity optimization as our overall strategy for finding communities, we still must devise a way of identifying plausible assignments of nodes to communities.

This is non-obvious and, for networks of any appreciable size, trying all possible community assignments is prohibitively expensive in terms of computing time. This means that we need to devise a way of iteratively dividing nodes into communities to search for a good assignment.

One approach ... use of betweenness centrality.

Label Propagation

Conclusion

References

<references group=""></references>

Discussion questions

Problems

Glossary

  • [[Def: ]]
  • [[Def: ]]
  • [[Def: ]]