## Difference between revisions of "Networks Basic Statistics"

### From OPOSSEM

Derek Ruths (talk | contribs) (→Local Clustering Coefficient) |
Derek Ruths (talk | contribs) (→Local Clustering Coefficient) |
||

Line 58: | Line 58: | ||

where <math>\mathcal{N}(v)</math> is the set of nodes in node <math>v</math>'s neighborhood. The denominator is twice the number of "V"s (closed or unclosed) in the <math>v</math>'s neighborhood. The numerator equals the number of closed "V"s (triangles) in the node's neighborhood. The factor of <math>2</math> simply reduces the denominator by a factor of two (making it the count of the number of "V"'s). | where <math>\mathcal{N}(v)</math> is the set of nodes in node <math>v</math>'s neighborhood. The denominator is twice the number of "V"s (closed or unclosed) in the <math>v</math>'s neighborhood. The numerator equals the number of closed "V"s (triangles) in the node's neighborhood. The factor of <math>2</math> simply reduces the denominator by a factor of two (making it the count of the number of "V"'s). | ||

+ | As with the global clustering coefficient, <math>C_{lcc}</math> will range between 0 and 1, from no clustering to complete clustering. | ||

<!-- [[File:Clustering_coefficient_example.png]] --> | <!-- [[File:Clustering_coefficient_example.png]] --> |

## Revision as of 07:38, 8 July 2011

## Contents

# Objectives

- Compute the degree distribution for undirected and directed networks
- Quantify the local density of connectivity in the network
- Understand why motif distributions must be estimated in large networks

# Introduction

Fundamentally, a network is a body of data. As with other data sets, our first task it to identify significant features of that data. In much the same way as the mean and standard deviation are easy to compute but very valuable summary statistics, degree, clustering, and motif statistics are easy to compute attributes that can reveal a great deal about the structure of a network. We discuss these different basic statistics here.

# Degree-based Statistics

Given that networks are composed of nodes with relationships, an obvious first question to ask is how these relationships are distributed among the nodes in the network. The number of edges that are incident to a node is called the node's *degree*. In this section we discuss some ways in which the degree of nodes can help us understand the structure of the network and the mechanisms that formed it.

## Degree Density

A very easy-to-compute attribute we can consider is the average degree of nodes in the network: how many edges does each node have? This can be calculated using the ratio <math>\frac{2|E|}{|V|}</math> which indicates the number of edges on average that are incident to a node.

## Degree Distribution

The problem with using average degree is that, as with all statistical distributions, it can be produced by a wide array of distributions. In fact, the distribution of edges can tell us much more than just the average about the distribution of power and resources within the network. Consider, as an example, the campaign contribution network discussed earlier. A network in which all candidates have similar number of contributions (edges) would indicate a very different world from a network in which most contributions (edges) go to a handful of candidates with most receiving very few. Both could have the same average degree, but the former network would indicate that resources are fairly distributed among candidates; the latter would suggest that a handful of candidate receive a disproportionate number of resources. Determining what situation exists in the real world would help us in formulating hypotheses about how the campaign system works.

In order to study this distribution, we could look at the *degree distribution*. The degree distribution, <math>P(k)</math> is the fraction of nodes with degree <math>k</math>. Often the cumulative degree distribution is also considered, which is the fraction of nodes with degree less than or equal to <math>k</math>.

### In and Out Degrees

In directed networks, where edges point from one node to another, there are two distributions: the in-degree <math>P^+(k)</math> which is the fraction of nodes with in-degree of <math>k</math> and the out-degree <math>P^-(k)</math> which is the fraction of nodes with out-degree of <math>k</math>.

### Scale-free Distributions

One distribution which has been observed in a wide array of networks is called the scale-free distribution. This distribution follows a power-law, <math>P(k) \approx k^{-\gamma}</math>. On a qualitative level, in a network with such a distribution, the majority of the nodes have very low degree, but with a small, but non-trivial number of nodes have very high degree. The high degree nodes are called 'hubs' and are often found to hold positions of power in the network proportional to their degree.

# Clustering Coefficient

Consider, for a moment, two different ways your own personal friend circle could look. Assuming that you have <math>N</math> friends in your circle, on one extreme, all of your friends could know one another. On the other extreme none of your friends know one another. From a network perspective the former situation would have all your friends connected to each other by edges; in the latter situation you would be like a hub, connected to each friend by and edge, but none of those friends being connected to one another. When all an individual's friends (more generally when a node's neighbors) are connected to one another, we call that part of the network highly *clustered*.

Clustering can have very serious implications in social systems. Consider how different your birthday party would be if all your friends knew one another versus if nobody knew each other. In the latter situation, it would be hard to even get people to come due to a complete lack of social cohesion! This playful example is no different from clustering in more serious situations involving political actors. A well connected neighborhood creates a basis for social and political action, a pillar of political science. For this reason, being able to detect and quantify the degree of clustering in a network is a useful tool and can offer profound insights into the behavior we might expect from a social system. Here we examine several ways of measuring clustering in a network.

## Global Clustering Coefficient

Often when we just begin our analysis of a network, we don't have a specific part of the network in mind and would like to determine some global attributes. The global clustering coefficient quantifies the overall amount of clustering in the network.

Before delving into the formal definition of the global clustering coefficient, let's think about what such a measure should tell us. Going back to the example we started with, your friend circle, consider the simplest situation where you have just two friends. By definition they are connected to you by edges. However, they may or may not be connected to each other. Intuitively clustering must be higher in the network if two nodes that share a mutual friend (in this case, you) are also connected.

Consider that two node,<math>v_i</math> and <math>v_j</math>, that share a mutual neighbor, <math>v_k</math>, produce a very specific shape in the network: a "V". To see this, observe that for <math>v_i</math> and <math>v_j</math> to share a neighbor, edges <math>v_i-v_k</math> and <math>v_j-v_k</math> must both exist.

If that part of the network is clustered, then a third edge, <math>v_i-v_j</math>, must exist. This turns the "V" shape into a closed triangle - often referred to as *triadic closure*. This concept will occur in our other definitions of clustering as well.

Thus, one way of quantifying the amount of clustering in the network is to compute the fraction of all "V" shapes that are actually closed triangles. We can formally expres this as

<math>C_{gcc} = \frac{3 \times \mbox{number of triangles}}{\mbox{number of connected triples}}</math>

Note that the term *connected triples* is simply another way of describing the "V" shape since for three nodes, (*triple*) the "V" is the minimum connectivity required to make the nodes connected.

The value of <math>C_{gcc}</math> will vary between 0 (unclustered) and 1 (totally clustered) making clustering scores comparable across different network data sets. Such normalized measures are very useful when comparing networks constructed for the same system (e.g., the US senate) for different time periods - permitting the ability to see whether clustering is increasing, decreasing, or staying the same.

## Local Clustering Coefficient

Now let's turn our attention from the network as a whole to a single node. How can we measure the clustering of that one node's neighborhood? This maps directly back to the original motivation for clustering - understanding how tightly connected a person's friend circle (or, more formally, a node's neighborhood) is.

To measure this we can easily adapt our global clustering coefficient to this situation. Recall that the global clustering coefficient looked at the ratio of connected triples that were closed (were triangles). We can do exactly the same thing here, except rather than counting "V"s and triangles in the entire network, we'll consider only the one node's neighborhood.

<math>C_{lcc}(v) = \frac{2\times \mbox{triangles containing node } v}{|\mathcal{N}(v)||\mathcal{N}(v)-1|}</math>

where <math>\mathcal{N}(v)</math> is the set of nodes in node <math>v</math>'s neighborhood. The denominator is twice the number of "V"s (closed or unclosed) in the <math>v</math>'s neighborhood. The numerator equals the number of closed "V"s (triangles) in the node's neighborhood. The factor of <math>2</math> simply reduces the denominator by a factor of two (making it the count of the number of "V"'s).

As with the global clustering coefficient, <math>C_{lcc}</math> will range between 0 and 1, from no clustering to complete clustering.

## Network Average Clustering Coefficient

# Motif Distributions

# Conclusion

# References

# Discussion questions

# Problems

# Glossary

- [[Def: ]]
- [[Def: ]]
- [[Def: ]]