In this notebook we demonstrate a few of the `Python`

ecosystem tools that enable **research** in areas that can be difficult to do using traditional tools such as `Stata`

that are typically fit-for-purpose tools.

The agility of a full programming language environment allows for a high degree of flexibility and the Python ecosystem provides a vast toolkit to remain productive.

- The Product Space Network (Hidalgo, 2007)
- Quick introduction to Networks and Graphs
- Replicate Product Space Proximity Measure
- Compute Revealed Comparative Advantage and @@0@@) and make this code run fast
**Tools: Pandas, Numpy, Numba, Dask** - (Extension) Building Networks and Plotting Product Space Network Diagrams - albiet not as fancy
**Tools: NetworkX**

- Compute Revealed Comparative Advantage and @@0@@) and make this code run fast

Loading output library...

**Oil (3330)**, has a large world export share, but is not strongly co-exported (i.e. connected in the network) with any other products (other than LNG).

**Machinery, Electronics, Garments** are all sectors that have a high degree of co-export potential with other related products and form part of a densely connected core of the network.

**Developing Economies** typically occupy products in the **weakly connected periphery** of the network and **new products** tend to emerge close to exisiting products in the network. (established from analysis using the Product Space network). Middle Income Countries manage to diffuse into the densely connected core of the product space

Loading output library...

Interest in studying networks is increasing within **Economics** with recent publications building network type features into their models, or using network analysis to uncover structural features of data that may otherwise go unexplored.

Many people who have interacted with tools from network analysis have done so via the idea of Social Network Analysis (SNA).

`1`

`A Graph is a way of specifying relationships among a collection of items`

They consist of a collection of **nodes** (or vertices) that are joined together by **edges**.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

One early example of Social Network Analysis was conducted by Zachary (1977) who set out to use network analysis to explain factional dynamics and to understand **fission in small groups.** A network of **friendship** was used to understand and identify how this Karate group eventually split due to an initial conflict between two members.

**Nodes:**Individuals**Edges:**Connections were added between two individuals if they were consistently observed to interact outside the normal activities of the club.

Loading output library...

The structure of these relationships can be exploited to uncover new insights into the data:

- Communities (through Clustering)
- Identification of main actors in Social Networks (Centrality Metrics)
- Identifying indirect relationships through shortest / longest paths
- Diffusion characteristics on temporal networks (such as disease transmission modeling)
- ... + many other applications across many different sciences

One visualization (Cao, 2013) demonstrates how algorithmic analysis can reveal meaningful structure that clearly identifies roles played by certain individuals, that is based on observing simpler relational information on friendship between pairs of individuals.

Loading output library...

Let's focus on an application of network analysis that is applied to international trade data to replicate some of the results contained in the Hidalgo (2007) paper and later in the The Atlas of Complexity and The Observatory of Economic Complexity.

The Hidalgo (2007) paper is used as a **motivating example** to demonstrate various tools that are available in the `Python`

ecosystem.

In this setting we want to looking at a characterisation of International Trade data by considering:

**Nodes:**Products**Edges:**the likelihood of two products being co-exported

**Assumption:** If products are highly co-exported across countries, then the products are *revealed* to be more likely to share similar factors of production (or capabilities) required to produce them. For example, Shirts and Pants require a set of similar skills that lend themselves to be co-exported, while shirts and cars are much more dissimilar.

This relational information between products can be represented by a edge weights.

A **high value** means they have a **high likelihood of being co-exported**

Loading output library...

Loading output library...

Loading output library...

Loading output library...

We now want to compute the edge weights to explore the full product space network derived from product level international trade data by computing the proximity matrix:

@@0@@

**Proximity:** A **high** proximity value suggests any two products are exported by a similar set of countries.

The **tasks** involve:

- Compute Revealed Comparative Advantage and @@1@@) and make this code run fast
**Tools: Pandas, Numpy, Numba, Dask** - Building Networks and Plotting Product Space Network Diagrams -
*albiet not as fancy***Tools: NetworkX**

--------------------_

Loading output library...

Loading output library...

International Trade Data is largely available in SITC and HS product classification systems.

In this notebook we will focus on SITC revision 2 Level 4 data with `786`

defined products.

Classification | Level | Products |
---|---|---|

SITC | 4 | 786 |

HS | 6 | 5016 |

**Note:**

We use `SITC`

data in this seminar, but as you can see performance of code becomes even more important when working with fully disaggregated `HS`

international trade data

Loading output library...

**Hint:** There is a method named `unique()`

, so you should get the array of years and then call `.unique()`

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

The literature uses the standard Balassa definition for Revealed Comparative Advantage

@@0@@

where,

- @@1@@

**Reference:** Balassa, B. (1965), Trade Liberalisation and Revealed Comparative Advantage, The Manchester School, 33, 99-123.

To compute **RCA** we need to aggregate data at difference levels to obtain each component of the fraction defined above.

Let's break the equation down to figure out what needs to be computed:

@@0@@

Loading output library...

This gives us a `pandas.DataFrame`

that is indexed by a multi-index object. This can be very useful but we would like to use this data in the original data table for each product exported at time t by each country. We could use this new object and:

`merge`

the data back into the original data DataFrame- use
`transform`

to request an object that is of the same shape as the original data DataFrame.

Now that the components of the equation have been computed we can now simply calculate @@0@@ as expressed by the original fraction

Loading output library...

@@1@@ is where country @@2@@ has a revealed comparative advantage in product @@3@@ at time @@4@@

Therefore we can define the matrix @@5@@:

@@6@@

We can first construct @@7@@ matrices and then compute @@8@@ using a conditional map

Loading output library...

`rca`

to compute the `mcp`

matrix?Loading output library...

Loading output library...

**Proximity:** A **high** proximity value suggests any two products are exported by a similar set of countries.

@@1@@

The minimum **conditional probability of coexport** can be computed:

@@2@@

where,

- @@3@@

The @@4@@ matrix is therefore computed through all pairwise combinations of column vectors which is computationally intensive.

Hidalgo (2007) suggests that 32% of values are < 0.1 and 65% of values are < 0.2

Loading output library...

Loading output library...

Loading output library...

at **~1 minute** this is taking a reasonably long time to compute for one year. This makes working with this data in an agile way problematic and computing for 50 years would take an hour to compute. While this was easy to implement, it isn't very fast!

Let's **profile** this code to get an understanding where we spend most of our time

For this line to run you will need to install `line_profiler`

by running:

`1`

`conda install line_profiler`

Most of the time you will want to conduct **numerical** type computing in NumPy.

The code actually looks pretty similar - the main difference is conducting operations on pure numpy arrays

Loading output library...

**Numba** is a package you can use to accelerate your code by using a technique called **just in time (or JIT)** compilation. It converts your high-level python code to low level llvm code to run it closer to the raw machine level.

`nopython=True`

ensures the `jit`

compiles without any `python`

objects. If it cannot achieve this it will throw an error.

**Numba** now supports a lot of the `NumPy`

api and can be checked here

Loading output library...

Loading output library...

Now that we have a fast single year computation, we can compute all cross-sections serially using a loop.

Alternatively, we can parallelize these operations using `Dask`

to delay computation and then ask the `Dask`

scheduler to coordinate the computation over the number of **cores** available to you. This is particularly useful when using `HS`

data.

**Note:** This simple approach to parallelization does have some overhead to coordinate the computations so you won't get a full 4 x speed up when using a 4-core machine.

Loading output library...

Loading output library...

Loading output library...

**Note:** Dask does a lot more than this and is worth looking into for medium to large scale computations

Loading output library...

For **SITC** Data: (786 Products, 229 Countries, 52 Years)

Function | Time/Year | Total Time | Speedup |
---|---|---|---|

pandas | 220 seconds | ~177 minutes | - |

pandas_symmetric | 104 seconds | ~84 minutes | BASE |

numpy | 2.5 seconds | 120 seconds | ~41x |

numba | 124 milliseconds | 6 seconds | ~800x |

numba + dask | N/A | 5 seconds | - |

For **HS** Data: (5016 Products, 222 Countries, 20 Years)

Function | Time/Year | Total Time | Speedup |
---|---|---|---|

pandas | 1 Hour 25 minutes | - | - |

pandas_symmetric | 43 minutes | - | BASE |

numpy | 1 min 37 seconds | - | ~28x |

numba | 5 seconds | 1min 45 seconds | ~516x |

numba + dask | N/A | 45 seconds | - |

These were run on the following **machine:**

Item | Details |
---|---|

Processor | Xeon E5 @ 3.6Ghz |

Cores | 8 |

RAM | 32Gb RAM |

Python | Python 3.6 |

------------------_

Here we will use `NetworkX`

to construct our version of the Product Space using Python

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

We want to now construct a maximum_spanning_tree and then add in all nodes that are highly connected above a threshold value of `0.5`

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**Note:** File is saved Locally to view the network in greater detail

Loading output library...

Loading output library...

1 Zachary, W. (1977), "An Information Flow Model for Conflict and Fission in Small Groups", Journal of Anthropological Research, Vol. 33, No. 4 (Winter, 1977), pp. 452-473

2 Cao, X., Wang X., Jin D., Cao Y. & He, D. (2013), "Identifying overlapping communities as well as hubs and outliers via nonnegative matrix factorization", Scientific Reports, Vol 3, Issue 2993

3 Hidalgo, C.A., Klinger, B., Barabasi, A.-L., Hausmann, R. (2007), "The Product Space Conditions the Development of Nations", Science, Vol 317, pp 482-487

4 Atlas of Complexity (http://atlas.cid.harvard.edu/)

5 The Observatory of Economic Complexity (http://atlas.media.mit.edu/en/)

6 Atlas of Complexity Gride Points for Nodes sourced from http://www.michelecoscia.com/?page_id=223

7 Balassa, B. (1965), "Trade Liberalisation and Revealed Comparative Advantage", The Manchester School, 33, 99-123.