## Networks Introduction

### From OPOSSEM

## Contents

# Objectives

- Define what a network is
- Define the different kinds of networks
- Give examples of networks that arise in Political Science
- Survey different techniques involved in network analysis

# Introduction

In recent years network science has enjoyed significant and rising attention from a diverse array of fields including biology, economics, physics, sociology, ecology, and political science. While the subject matter of these fields can be very different, each discipline must deal with interacting groups of entities which, when formalized, we call a *network*. Biology deals with genes, proteins, and cells which come together to produce life itself; economics considers the transactions which give rise to social and economic systems; sociology investigates the fundamental questions of how people form and maintain relationships with one another and how these give rise to communities; and political science considers the wide array of questions that arise when a group of individuals create and interact with a form of governance.

In political science, networks play a number of important roles. For instance, political parties in the United States are not monolithic organizations, but rather a number of independent organizations with connections between them. The ways that these connections are organized can have an important impact on how the parties behave.
<ref>Kroger, Gregory; Noel, Hans (2009). "Partisan Webs: Information Exchange and Party Networks". *British Journal of Political Science* **39**: 633–653.</ref>
Non-governmental organizations are also embedded in complex networks, and the connections between them can affect their access to important resources like information.
<ref>Keck, Margaret; Sikkink, Kathryn (1998). *Activists Beyond Borders: Advocacy Networks in International Politics*. Cornell University Press.</ref>

There are two things shared in common by the examples above. First, in each instance, there is structure to the way in which the entities interacted. Second, the way in which the entities are interacting influences the phenomena of interest. These are the major components of network science: the structure is the network; we analyze the structure to better understand the phenomena it produces.

# What is a network?

A network is a mathematical way of expressing the relationships present within a set of entities as well as the attributes of those entities and relationships. Network science encapsulates the mathematical, computational, and statistical machinery by which we can analyze a network.

Before delving into the formalisms, let's consider several examples of networks that occur in political networks.

- Congressional committee membership:
**TODO** - Campaign contribution:
**TODO** - PACs:
**TODO**

## Nodes and Edges

A network consists of entities, which we call *nodes*, and the relationships among these entities, which we call *edges*. If analyzing the friendship network in the US senate, then each node will be a senator and each edge will be a friendship between two senators.

Formally, we express a network as <math>N = (V,E)</math> where the nodes are <math>V = \{v_1,v_2,...,v_m\}</math> and the edges are <math>E = \{e_1,e_2,...,e_n\}</math>. Since an edge connects two vertices, edges are often expressed as their endpoints. For example, if a edge <math>e_1</math> connects nodes <math>n_1</math> and <math>n_2</math>, then we could write <math>e_1 = (v_1,v_2)</math>.

It's important to keep in mind that nodes and edges are abstract entities - they can represent anything you choose. Thus, if studying the flow of money through PACs, you could construct a network where nodes are PACs and edge <math>(v_i,v_j)</math> indicates that money moved directly between PACs <math>v_i</math> and <math>v_j</math>.

### Attributes

Going back to our example above with the friendship network in the US senate, it's clear that if we just represent senators as nodes and friendships as edges, we're losing a lot of information. For example, we might want to capture the age and seniority of each senator, we might be interested in expressing how strong each friendship is or the age of the friendship. All of these features can be included in the network as node and edge attributes. Once they are incorporated, we can use these attributes are part of our analysis.

When building a network, there are no rules as to what kind of attributes can or cannot be used. Numbers, dates, and even unstructured text can be assigned to node and edge attributes. However, as in most areas of statistics, mathematics, and computer science, quantitative analysis requires numbers - so these are the kinds of attributes we will focus on here. For this reason, it is common practice to distill more expressive features into numerical values: for example, birthdates are converted into numerical ages, blocks of text can be scored by their mood, tone, relevance, etc...

There are several attributes that are commonly used in network analysis.

- Edge weights: an edge's weight, typically depicted as <math>w(e)</math> is a real number that expresses the strength or importance of the edge. The weight is important in a variety of analyses. Consider, for example, trying to identify friend circles in the US senate: it would be helpful to know which friendships were stronger in order to prioritize which individuals should be put into the same friend circles (stronger friends should generally go into the same friend circle).
- Node and edge types: often a network will consist of several kinds of entities or relationships. Consider the network of interactions among US senators (assuming such a network could be built). We might want to capture the different kinds of interactions present in the network. This would involve having a type attribute for each edge indicating whether it was a face-to-face meeting, a phone call, or an email.

## Directionality in networks

Consider the structural difference between a network of friendships among of senators vs. a network of emails among senators. In the friendship network, an edge simply indicates that the two senators it connects are friends. Whereas in the email network, we want an edge to represent one senator sending an email to a second senator. In this situation we want the edge to encode the sender-receiver relationship, a nuance that is not present in the friendship network.

In order to do this, we need to introduce the idea of edge *directionality*. If an edge is directed, it indicates that something is moving from the first node (called the *source*) to the second node (called the *target*). In the case of the email network, the edge <math>(v_1,v_2)</math> indicates that an email has moved from node <math>v_1</math> to <math>v_2</math> (this is sometimes written <math>v_1 \rightarrow v_2</math>). Since in the email network every edge indicates that an email, every edge will be directed. We call such a network *directed*. A network in which none of the edges are directed is called *undirected*. These two types, directed and undirected, are the most common types of networks used in network analysis.

Let's take a moment to examine how undirected vs. directed edges are notationally different. In an undirected graph, each edge represents a relationship in which there is no directionality: I can say "Bob is a friend of Sally" or "Sally is a friend of Bob". The order doesn't matter. Not surprisingly, then, we can formally write that edge as either *(Bob,Sally)* or *(Sally,Bob)* - the ordering of the node pair doesn't matter.

Directed edges, on the other hand, are sensitive to the ordering of the nodes. <math>(v_1,v_2)</math> means something different from <math>(v_2,v_1)</math>. In the first case, we are saying that something is moving from <math>v_1</math> to <math>v_2</math>; in the other case we're saying the opposite, that something is moving from <math>v_2</math> to <math>v_1</math>.

## Weighted networks

If we consider our email network again, there's another modification we can make to it which will allow it to express more of the underlying information. Consider that, so far, edge <math>(v_1,v_2)</math> represents an email from <math>v_1</math> to <math>v_2</math>. But what if <math>v_1</math> sends another email to <math>v_2</math>? We could introduce another edge to represent this email, but over time this would explode the number of edges in our network. Instead, we can use the edge weight property to express the number of emails that have been sent between two senators. So if <math>v_1</math> has sent 5 emails to <math>v_2</math>, then <math>w( (v_1,v_2) ) = 5</math>. We'll see that taking this approach can be used by a variety of analytical techniques.

Such a network in which the edge weight is being used is called a *weighted network*. The email network, then, would be called a *weighted, directed network*.

Undirected networks can also be weighted. Consider the friendship network of US senators. If we wanted to use the number of times two individuals go to lunch as a proxy for the strength of their friendship, we could encode this lunch count as the weight of the edge connecting them.

In later sections, we'll see a number of techniques for using these weights to analyze patterns of connectivity among entities in our network.

## Specialized Network Types

While most networks you will encounter will be either directed or undirected, following the formalisms described above, there are a number of more specialized network types. Because these can crop up from time to time, it is worth being aware of these network types.

### Multigraphs

Up to this point we have only considered networks in which each edge is uniquely identified by its endpoints. This means that in an undirected network there can be only one edge between <math>v_i</math> and <math>v_j</math>; in a directed network, there can be at most one edge <math>v_i \rightarrow v_j</math> and one edge from <math>v_j \rightarrow v_i</math>.

This was fine when working with friendships, since there can only be one friendship between two people, or emails, since we could keep a count of emails as part of the edge weight. However, in some circumstances, we may want to actually permit multiple edges to exist between the same endpoints in the network. Suppose, for example, we were expressing the interaction network among US senators by putting both phone call and emails as edges in the network. In situations where two senators had interacted using both, we would have multiple edges leading from <math>v_i</math> to <math>v_j</math>. A network in which such a situation arises is called a *multigraph*.

In general, such networks are harder to work with: fewer analytical methods exist that can handle them and their overall complexity is much higher. As a result, in most situations researchers will favor collapsing the parallel edges of the multigraph by redefining the meaning of an edge. For example, in the above example, parallel edges could occur when senators interacted using both phone and email. By redefining the edge as indicating *any* interaction, however, we can collapse the parallel edges into a single edge, indicating that an interaction of some sort occurred. To preserve interaction type or intensity information, we could set the edge weight to be the total number of phone and email interactions between those individuals.

Admittedly, there will be some circumstances where multigraphs cannot be collapsed without losing critical information. In such a situation, you should simply be aware that finding the appropriate tools could be difficult. In some cases, they may not exist at all and you will have to develop them yourself.

### Bipartite networks

Suppose we were interested in looking at the network structure of individual campaign contribution data. In such data, an citizen can give money to a candidate's committee. The directed network underlying such a process is clear: nodes are citizens and campaign committees, edges are individual contributions.

However, notice that there are two special features of the structure of such a network. First, there are exactly two types of nodes: citizens and committees. Furthermore, these types don't overlap: a citizen cannot be a committee. Second, an edge always connects nodes of different types: a citizen node to a committee node (for our purposes committee's can't give to one another and individuals can't give to one another either).

The conditions above define a bipartite network. The fact that nodes of a given type never are connected to nodes of the same type creates a very special structure shown in this figure.

In the figure, nodes of type U are citizens and type V are committees. Because edges only pass between the classes, we can visually separate them as shown. Realize that this cannot be done for other networks we've discussed - even if they contain exactly two types of nodes. Bipartite networks have two classes of nodes such that relationships are exclusively between classes.

### Hypergraphs

Consider the email network we've discussed previously. To this point we've only considered emails from one person to another. However, consider that it's possible for one person to specify multiple recipients for a given email message. In such a situation, we must consider how such an email should be entered into a network model.

Up to this point we have only one choice. Since an edge has two end points, we must add an individual edge from the sender to each recipient in the list. For some analysis, this will be sufficient. However, notice that by representing the multiple recipient email as several independent edges, we cannot distinguish one email to several people from several separate emails, one to each of the recipients. Sometimes this isn't important and we won't worry about the loss of information. However, in some situations, such omission can be problematic. In other situations, however, it can be useful to have such information.

Consider the problem of identifying groups of people who often work together. In such a situation, it would be useful to know whether a set of individuals had been emailed together on the same email or emailed separately - sharing an email correspondance would definitely strongly suggest working together in some capacity.

In order to permit this, however, we have to expand our definition of an edge from having only two endpoints to have an arbitrary number of endpoints. An edge which contains more than two endpoints is called a *hyperedge*. And a graph containing one or more hyperedges is called a *hypergraph*. All other notions such as edge weight and node attributes remain the same.

The diagram below depicts an undirected hypergraph.

Notice that different edges contain different numbers of endpoints. The diagram itself seems much more complicated than one of the more standard networks we've considered before. Not surprisingly, hypergraphs are quite difficult to work with both conceptually and technically. Most network analysis algorithms we will consider become extremely expensive to use on hypergraphs, making them undesirable to work with.

However, despite the difficulties, hypergraphs are powerful abstractions of a number of processes, particular in those domains where groups of individuals are involved (a group is quite naturally represented by a hyperedge). While it can be hard to work with hypergraphs, it can be helpful to try using them since modeling group processes is hard to approximate without the use of hyperedges.

## Networks vs. graphs

For individuals who have had some exposure to graph theory, the nomenclature of network science can be somewhat confusing. What is the difference between a network and a graph? This is a topic of some debate. However, what most practitioners of network science would agree is that graphs are abstract topological structures whereas networks are models of real-world processes. It's true that networks share all the theoretical features of graphs. However, networks are, themselves, organically produced by real-world processes.

From a practical stand point, the distinction isn't important. In you are coming to network science from graph theory, any distinction between the term network and graph is largely semantic.

# Overview of Network Analysis

Thus far we have concerned ourselves only with how to represent data as a network. The reason we do this, however, is to analyze this data. The rest of the Networks module is concerned with enabling this analysis. Here we briefly introduce some of the major analytical tools available which will be discussed in later modules.

## Basic Network Statistics

Given a group of individuals, companies, or political committees, some of the most immediate questions have to do with simply quantifying what the connectivity within the group looks like. Questions along these lines might include:

- How many relationships does each entity have on average?
- Do all entities have about the same number of relationships or is there a wide distribution?
- Do entities tend to cluster together forming subgroups or are they fairly well-connected across the group?

## Centrality

Often we would like to identify particularly important or essential entities in the network. This question is called *centrality* and can be defined in a number of different ways.

## Community Detection

Understanding the organization of a group of individuals involves recognizing the subgroups that are working together. These subgroups, called *communities* in network science, often can be approximated by looking for sets of individuals who are quite closely connected to one another. A number of algorithms exist for finding communities.

## Visualizing Networks

While network visualization is a very appealing part of the analysis process, it is largely not helpful, particularly in the early phases of analysis. Network data tends to be very complex and does not lend itself to visualization. As a result, one must have a good understanding of when visualization can be useful in order to be efficient and effective at network analysis.

# Conclusion

# References

# Discussion questions

# Problems

# Glossary

- [[Def: ]]
- [[Def: ]]
- [[Def: ]]