Advanced Data Analysis using Python

#Advanced-Data-Analysis-using-Python

Matthew McKay

In this notebook we demonstrate a few of the Python ecosystem tools that enable research in areas that can be difficult to do using traditional tools such as Stata that are typically fit-for-purpose tools.

The agility of a full programming language environment allows for a high degree of flexibility and the Python ecosystem provides a vast toolkit to remain productive.

Table of Contents

#Table-of-Contents
  • The Product Space Network (Hidalgo, 2007)
  • Quick introduction to Networks and Graphs
  • Replicate Product Space Proximity Measure

Atlas of Complexity Product Space Map

#Atlas-of-Complexity-Product-Space-Map
Loading output library...

Some Initial Observations

#Some-Initial-Observations

Oil (3330), has a large world export share, but is not strongly co-exported (i.e. connected in the network) with any other products (other than LNG).

Machinery, Electronics, Garments are all sectors that have a high degree of co-export potential with other related products and form part of a densely connected core of the network.

Developing Economies typically occupy products in the weakly connected periphery of the network and new products tend to emerge close to exisiting products in the network. (established from analysis using the Product Space network). Middle Income Countries manage to diffuse into the densely connected core of the product space

Loading output library...

Network Analysis

#Network-Analysis

Interest in studying networks is increasing within Economics with recent publications building network type features into their models, or using network analysis to uncover structural features of data that may otherwise go unexplored.

What is a Network (Graph)?

#What-is-a-Network-(Graph)?

Many people who have interacted with tools from network analysis have done so via the idea of Social Network Analysis (SNA).

1
A Graph is a way of specifying relationships among a collection of items

They consist of a collection of nodes (or vertices) that are joined together by edges.

Loading output library...
Loading output library...

You can use network metrics to learn more about the structure. What is the most central node?

#You-can-use-network-metrics-to-learn-more-about-the-structure.-What-is-the-most-central-node?
Loading output library...
Loading output library...
Loading output library...

What can we learn from Networks?

#What-can-we-learn-from-Networks?

Social Network Example: Karate Club (Zachary, 1977)

#Social-Network-Example:-Karate-Club-(Zachary,-1977)

One early example of Social Network Analysis was conducted by Zachary (1977) who set out to use network analysis to explain factional dynamics and to understand fission in small groups. A network of friendship was used to understand and identify how this Karate group eventually split due to an initial conflict between two members.

  • Nodes: Individuals
  • Edges: Connections were added between two individuals if they were consistently observed to interact outside the normal activities of the club.
Loading output library...

We can learn things by considering the structure of these networks

#We-can-learn-things-by-considering-the-structure-of-these-networks

The structure of these relationships can be exploited to uncover new insights into the data:

  • Communities (through Clustering)
  • Identification of main actors in Social Networks (Centrality Metrics)
  • Identifying indirect relationships through shortest / longest paths
  • Diffusion characteristics on temporal networks (such as disease transmission modeling)
  • ... + many other applications across many different sciences

One visualization (Cao, 2013) demonstrates how algorithmic analysis can reveal meaningful structure that clearly identifies roles played by certain individuals, that is based on observing simpler relational information on friendship between pairs of individuals.

Loading output library...

Replicating the Product Space Network using International Trade Data (Hidalgo, 2007)

#Replicating-the-Product-Space-Network-using-International-Trade-Data-(Hidalgo,-2007)

Let's focus on an application of network analysis that is applied to international trade data to replicate some of the results contained in the Hidalgo (2007) paper and later in the The Atlas of Complexity and The Observatory of Economic Complexity.

The Hidalgo (2007) paper is used as a motivating example to demonstrate various tools that are available in the Python ecosystem.

In this setting we want to looking at a characterisation of International Trade data by considering:

  • Nodes: Products
  • Edges: the likelihood of two products being co-exported

Assumption: If products are highly co-exported across countries, then the products are revealed to be more likely to share similar factors of production (or capabilities) required to produce them. For example, Shirts and Pants require a set of similar skills that lend themselves to be co-exported, while shirts and cars are much more dissimilar.

This relational information between products can be represented by a edge weights.

A high value means they have a high likelihood of being co-exported

Loading output library...

Let's work with a Toy Example with 8 products

#Let's-work-with-a-Toy-Example-with-8-products
Loading output library...
Loading output library...
Loading output library...

Scale Up to Full set of Products (SITC R2 L4)

#Scale-Up-to-Full-set-of-Products-(SITC-R2-L4)

We now want to compute the edge weights to explore the full product space network derived from product level international trade data by computing the proximity matrix:

@@0@@

Proximity: A high proximity value suggests any two products are exported by a similar set of countries.

The tasks involve:

--------------------_

Computing Proximity

#Computing-Proximity
Loading output library...
Loading output library...

Data

#Data

International Trade Data is largely available in SITC and HS product classification systems.

In this notebook we will focus on SITC revision 2 Level 4 data with 786 defined products.

ClassificationLevelProducts
SITC4786
HS65016

Note:

We use SITC data in this seminar, but as you can see performance of code becomes even more important when working with fully disaggregated HS international trade data

Loading output library...

Question 1: What years are available in this dataset?

#Question-1:-What-years-are-available-in-this-dataset?

Hint: There is a method named unique(), so you should get the array of years and then call .unique()

Loading output library...

Question 2: How many non-zero trade flow values are in this dataset?

#Question-2:-How-many-non-zero-trade-flow-values-are-in-this-dataset?
Loading output library...

Question 3: What countries are available in this dataset?

#Question-3:-What-countries-are-available-in-this-dataset?
Loading output library...
Loading output library...
Loading output library...

Computing Revealed Comparative Advantage

#Computing-Revealed-Comparative-Advantage

The literature uses the standard Balassa definition for Revealed Comparative Advantage

@@0@@

where,

  • @@1@@

Reference: Balassa, B. (1965), Trade Liberalisation and Revealed Comparative Advantage, The Manchester School, 33, 99-123.

To compute RCA we need to aggregate data at difference levels to obtain each component of the fraction defined above.

Let's break the equation down to figure out what needs to be computed:

@@0@@

Loading output library...

This gives us a pandas.DataFrame that is indexed by a multi-index object. This can be very useful but we would like to use this data in the original data table for each product exported at time t by each country. We could use this new object and:

  • merge the data back into the original data DataFrame
  • use transform to request an object that is of the same shape as the original data DataFrame.

Now that the components of the equation have been computed we can now simply calculate @@0@@ as expressed by the original fraction

Loading output library...

Computing @@0@@ Matrix: Who Exports What Products and When?

#Computing-@@0@@-Matrix:-Who-Exports-What-Products-and-When?

@@1@@ is where country @@2@@ has a revealed comparative advantage in product @@3@@ at time @@4@@

Therefore we can define the matrix @@5@@:

@@6@@

We can first construct @@7@@ matrices and then compute @@8@@ using a conditional map

Loading output library...

Question: What is the key assumption implied by the above code?

#Question:-What-is-the-key-assumption-implied-by-the-above-code?
Loading output library...

Question 6: What products did Australia ("AUS") export with RCA in 1998?

#Question-6:-What-products-did-Australia-("AUS")-export-with-RCA-in-1998?
Loading output library...

Computing Proximity Matrix @@0@@

#Computing-Proximity-Matrix-@@0@@

Proximity: A high proximity value suggests any two products are exported by a similar set of countries.

@@1@@

The minimum conditional probability of coexport can be computed:

@@2@@

where,

  • @@3@@

The @@4@@ matrix is therefore computed through all pairwise combinations of column vectors which is computationally intensive.

Step 1: Compute Proximity Matrix using Pandas

#Step-1:-Compute-Proximity-Matrix-using-Pandas

Check the Data (simple stats and visualizations)

#Check-the-Data-(simple-stats-and-visualizations)

Hidalgo (2007) suggests that 32% of values are < 0.1 and 65% of values are < 0.2

Loading output library...
Loading output library...
Loading output library...

But Wait - Problem!

#But-Wait---Problem!

at ~1 minute this is taking a reasonably long time to compute for one year. This makes working with this data in an agile way problematic and computing for 50 years would take an hour to compute. While this was easy to implement, it isn't very fast!

Let's profile this code to get an understanding where we spend most of our time

For this line to run you will need to install line_profiler by running:

1
conda install line_profiler

Step 2: Consider other Python Tools (NumPy)

#Step-2:-Consider-other-Python-Tools-(NumPy)

Most of the time you will want to conduct numerical type computing in NumPy.

The code actually looks pretty similar - the main difference is conducting operations on pure numpy arrays

Loading output library...

Step 3: Just in Time Compilation (Numba)

#Step-3:-Just-in-Time-Compilation-(Numba)

Numba is a package you can use to accelerate your code by using a technique called just in time (or JIT) compilation. It converts your high-level python code to low level llvm code to run it closer to the raw machine level.

nopython=True ensures the jit compiles without any python objects. If it cannot achieve this it will throw an error.

Numba now supports a lot of the NumPy api and can be checked here

Loading output library...

Computing All Years

#Computing-All-Years
Loading output library...

Using Dask to Compute all Years in Parallel

#Using-Dask-to-Compute-all-Years-in-Parallel

NOTE: THIS WON'T WORK ON DEMO DOCKER ENVIRONMENT

#NOTE:-THIS-WON'T-WORK-ON-DEMO-DOCKER-ENVIRONMENT

Now that we have a fast single year computation, we can compute all cross-sections serially using a loop.

Alternatively, we can parallelize these operations using Dask to delay computation and then ask the Dask scheduler to coordinate the computation over the number of cores available to you. This is particularly useful when using HS data.

Note: This simple approach to parallelization does have some overhead to coordinate the computations so you won't get a full 4 x speed up when using a 4-core machine.

Loading output library...
Loading output library...
Loading output library...

Note: Dask does a lot more than this and is worth looking into for medium to large scale computations

Loading output library...

Performance Comparison (SITC and HS Data)

#Performance-Comparison-(SITC-and-HS-Data)

For SITC Data: (786 Products, 229 Countries, 52 Years)

FunctionTime/YearTotal TimeSpeedup
pandas220 seconds~177 minutes-
pandas_symmetric104 seconds~84 minutesBASE
numpy2.5 seconds120 seconds~41x
numba124 milliseconds6 seconds~800x
numba + daskN/A5 seconds-

For HS Data: (5016 Products, 222 Countries, 20 Years)

FunctionTime/YearTotal TimeSpeedup
pandas1 Hour 25 minutes--
pandas_symmetric43 minutes-BASE
numpy1 min 37 seconds-~28x
numba5 seconds1min 45 seconds~516x
numba + daskN/A45 seconds-

These were run on the following machine:

ItemDetails
ProcessorXeon E5 @ 3.6Ghz
Cores8
RAM32Gb RAM
PythonPython 3.6

------------------_

(Extension) Preparing Graph Data: Product Space Network

#(Extension)-Preparing-Graph-Data:-Product-Space-Network

Here we will use NetworkX to construct our version of the Product Space using Python

Loading output library...

use pandas to construct and edge list

#use-pandas-to-construct-and-edge-list
Loading output library...
Loading output library...

We would like to construct the maximum_spanning_tree, but the current version of networkx supports minimum_spanning_tree so we need to add inv_weight for this computation.

#We-would-like-to-construct-the-maximum_spanning_tree,-but-the-current-version-of-networkx-supports-minimum_spanning_tree-so-we-need-to-add-inv_weight-for-this-computation.
Loading output library...
Loading output library...

Network Tools

#Network-Tools

We want to now construct a maximum_spanning_tree and then add in all nodes that are highly connected above a threshold value of 0.5

Loading output library...

Visualizations

#Visualizations
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Note: File is saved Locally to view the network in greater detail

Loading output library...
Loading output library...

References

#References

1 Zachary, W. (1977), "An Information Flow Model for Conflict and Fission in Small Groups", Journal of Anthropological Research, Vol. 33, No. 4 (Winter, 1977), pp. 452-473

2 Cao, X., Wang X., Jin D., Cao Y. & He, D. (2013), "Identifying overlapping communities as well as hubs and outliers via nonnegative matrix factorization", Scientific Reports, Vol 3, Issue 2993

3 Hidalgo, C.A., Klinger, B., Barabasi, A.-L., Hausmann, R. (2007), "The Product Space Conditions the Development of Nations", Science, Vol 317, pp 482-487

4 Atlas of Complexity (http://atlas.cid.harvard.edu/)

5 The Observatory of Economic Complexity (http://atlas.media.mit.edu/en/)

6 Atlas of Complexity Gride Points for Nodes sourced from http://www.michelecoscia.com/?page_id=223

7 Balassa, B. (1965), "Trade Liberalisation and Revealed Comparative Advantage", The Manchester School, 33, 99-123.