EE-558 A Network Tour of Data Science, EPFL
Joël M. Fonseca, Nelson Antunes, Enguerrand Granoux and Hedi Driss
The aim of this project is to understand the terrorist attacks from different perspectives. The data scientist's toolkit provides a panel of useful methods that allows to highlight the main characteristics of the data we are investigating. The project structure is defined as follows:
I. Data Acquisition
In this part, the goal is to understand the data. What are the information we have ? How is it represented ? Can missing values be recovered without making wrong assumptions ? What are the relevant variables as part of this project ? That is the kind of questions we try to answer.
II. Data Exploration:
Here, we need to bring the data alive, to let it express itself. We present a map that dynamically illustrates the evolution of all the terrorist attacks avaible from the database. We also plot some basic facts, like the most affected countries and the most used types of weapon. Finally, we propose alternative roads that can be further developed like the time series.
III. Data Exploitation:
In this section, we use the power of mathematics to represent the data in a more abstract level, in order to find the hidden structure of the attacks. In particular, Principal Component Analysis (PCA) is used for a specific terrorist group to highlight the pattern of similarities. Then, a spectral embedding with the Laplacian Eigenmaps algorithm is applied to compare the two techniques. In a second part, we will remove the group labels and see if we can recover them with different clustering algorithms.
Finally, the conclusions drawn from the results founded in the two previous sections are summarized.
In this project, we will work on the Global Terorisme Database which is an open-source database containing information about more than 170,000 terrorist attacks around the world from 1970 through 2016 provided by the START consortium. It can be downloaded here. Make sure you download it before running this notebook.
Let's first of all make the data usable for our analysis.
We will first analyze the content of the database to understand its content and then restrict ourselves to the most significant features in order to carry out our task in the best possible way.
We see that there is 135 different columns. We won't be using all of them. A complete description of each column can be found here. After thinking, we made a first selection of the most important features we will consider. Furthermore, we renamed all the columns of interest.
Let's look at each of the columns where there is missing information and see what we can do in order to complete it, if possible:
City: if we know the coordinates we can retrieve the name of the city.
Longitude: if we know the city and the country then we can retrieve the aproximated geographic coordinates.
Wounded: we cannot correclty retrieve this information without making wrong assumptions.
Motive: these fields are impossible to infer from the data we have, so we might left them as is.
The only field which makes sense to develop for the sequel of our analysis, is the one for the latitude and longitude coordinates. Indeed, as we will be plotting a world map with all the attacks it makes sense to be as complete as possible. Knowing the city won't influence our further analysis. For the two last fields, as discussed, we can't reliably retrieve this information.
To retrieve the latitude and longitude coordinates Google Maps Geocoding API seems to be the more appropriate tool. In order to minimize the number of requests, we have to create a mask to find which coordinates can be recovered.
We see that 2503 attacks are concerned. Let's see how many we can recover.
We have been able to retrieve about 1600 coordinates (64%) using this API. Although it is provided by Google, we see that there is still some cities that are not correctly recognized. It still was worth the try.
As we have the coordinates for all the attacks of each year, we can dynamically represent the activity of terrorism on a world map. To do so, we will create an animated plot using the Matplotlib Animation and Basemap libraries.
Essentially, the markers represent the attacks for a specific year. Besides the geographic coordinates, each marker @@0@@ has a corresponding size @@1@@ and color @@2@@ defined as follows:
where @@5@@ corresponds to the number of killed people for the attack represented by the marker @@6@@ and @@7@@ correspond respectively to the median, the min and the max number of killed people for the specific year.
These two indicators combined will give us a much more comprehensive situation of the gravity of the attacks based on the number of killed people.
This visualization allows to grasp very well the geographic context of the terrorist attacks. One can realize the degree of severity of the 9/11 attacks and perceive the intense activity in the Middle East for the last few years in particular.
We can also of course look at a particular group and see more precisely the location where it attacked or even the distribution of the attack types. The two following plots show the idea for the
We see a clear increase of attacks for the last 5 years. The following plots will respectively give us information about:
As they are very self-explanatory, we decided to not explicitly comment each one of these plots. All of them give a good overview of the situation and the context in which the terrorist attacks have evolved for the last decades.
As we also have the time for each data point, one can try to do some forecasting or merge it with other time series to see if one affects the other. Let's see how we can construct such time series for informational purposes.
The plot above shows the activity of the
Taliban from 2008 to 2011.
We will perform a Principal Component Analysis (PCA) in order to represent ourselves all the datapoints of the attacks in a comprehensive dimension for our mind.
Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 10 most active terrorist groups. We then tried for only one group to see if we had a better representation and it was actually the case. Hence, in the following cells, we analyze all the attacks for the
Taliban group only.
Another problem that needs to be handled is that we are facing with numerical and categorical variables. In order to convert the latter ones without making wrong assumptions, we will use the one-hot encoding that is already provided by the
pandas librabry under the
function. In words, it is equivalent to represent a categorical variable as a vector of @@0@@ dimensions, where @@1@@ is the number of categories available for the said variable. Then, all vectors representing a category are orthogonal to each other, avoiding this way hazardous mathematical assumptions.
It is also always better to normalize all our data to give the features all the same importance.
We see that the number of killed people increases as the second component decreases. For the number of wounded people it depends on the two components, but the dependance is more difficult to explain. Hence, we can say that the second component explains sufficiently the number of killed people.
The PCA for the different types of data shows interesting facts. For the two first plots, we are able to distinguish the clusters representing each of the categories. We can particularly notice the following results:
Bombing/Explosionis very well clustered. It is also apart from the remaining datapoints. The
Armed Assaultis also apart although it shares some caracteristics with the
Assassination. A portion of the
Unarmed Assaulthas its own disctinctive location. The remaining categories are grouped in the same fashion but with some dense clustering at some parts.
Chemicalare the most significant clusters.
target_typeplot: Here we see that PCA fails to explain the target categories, except possibly that above a specific threshold, the second component clearly identifies the
Coupled with the two previous plots, we can add to our analysis that firearms and explosions are the most murderous weapons. We can also say that there is no obvious type of weapon or attack that corresponds to a particular target unless for
Educational Institution type.
In this manner, we just demonstrated that PCA allows to represent high-dimensional information that is inter-correlated into a much lower dimension with patterns of similarity. But, when we want to do some clustering, a more suitable approach is to use the Laplacian Eigenmaps algorithm. Indeed, where PCA is generally used for linear dimensionality reduction, the Laplacian Eigenmaps is a spectral embedding for non-linear dimensionality reduction well-tried for clustering representation. Let's see what we get.
First, notice that the number of killed and wounded people is dealt the same way in both cases, i.e., the high datapoints lie approximately in the center. Remember that for the PCA, both cases where not treated the exact same way.
Then, with this method, we see a more organized representation. We can clearly distinguish three straight lines each representing a category with some condensed area where these three lines are joining. With the two first plots, we can clearly see a link as the variable they represent are, in a way, related: if we look at the
Explosives/Bombs/Dynamite weapon type we see without surprise that it was used in
Bombing/Explosion attacks but also in
Assassination cases and some
Armed Assault raids. For the last plot, we saw with PCA that it was difficult to read any information unless for the
Educational Institution target type. However, with the spectral embedding method we see that the
Police target type is heavily represented in the end of each straight line, which was not the case with PCA. The remaining target types are rather mix in the center.
In this way, we saw that, due to the different conceptual approaches of each technique, the two methods lead to distinctive representations. In the context of this project, we saw that spectral embedding using the Laplacian Eigenmaps algorithm performs a better clustering representation. Nevertheless, the two techniques allow to draw interesting facts.
In this analysis we will try to cluster our data. In particular, we will remove the group label and try to see if the clustering allows us to find the groups. This will allow us to find if there is some group caracteristics, and if groups naturally cluster our dataset.
Again, we will drop features that might not give insightful information for the next analysis.
We will add a new feature
isCapital who is 1 if the city is a capital or 0 otherwise. In order to do this, as there is no python module for that, we record all capitals of each country in our list of countries.
In order to match the capital and the city names we will homogenize both of them. First, lets convert the city name in lower case.
Now, we will use
.get_close_matches from the
difflib library in order to see if there is some capital who is not recognized due to some spelling difference.
We check manually what corresponds to capitals and what corresponds to other cities and just remove these lasts. Then, we concat the two lists.
Now, we apply a lambda function over the dataset to know if a city is a capital.
Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 5 most active terrorist groups since 2010.
As before we use binary encoding for the categorical features Attacks type, Region, Target type and Weapon type.
Now we can drop features that we don't need for our analysis. We just save the group labels to compare them with the results at the end of our analysis.
Then normalize our features as before.
Lets try the K-Means clustering algorithm. First, we will try to find the best
k for our data. To do this we try a range of k and then graphically analyse our result. We decide to try for
k between 1 and 20.
We were expected a discontinuity in the shape of the curve, that would indicate that there is a natural way to cluster our dataset over K-Means. Such result can be explained by the nature of this algorithm. Furthermore, K-means assumes that we deal with spherical clusters and that each cluster has roughly an equal number of observations.
We have no way to know if our data can be clustered by spherical clusters, but lets see if all the groups have roughly the same number of observations.
We see that groups don't have the same number of observations this can be the reason of our result.
One can think of Gaussian Mixture Models as generalizing K-Means clustering to incorporate information about the covariance structure of the data. Let's see is we can reach a better result.
Here we use the BIC criterion in order to select the number of components in an efficient way.
Here we can clearly identify a discontinuity in the shape of the curve, this let us think that the best clustering is obtain with
k around 5. This is interesting because we decide to focus only on the 5 most active groups. Lets focus our analysis on the GMM clustering with
k = 5.
Lets now see if the GMM with
k=5 is close to the group labels.
As we can see the clustering did not find the original groups. Even if the number of clusters is around 5, it's not the group label that dominated the clustering. It must be a combination of different factors, for example the number of victimes or the attack types as well as the group who can explain this clustering.
All along the Data Science pipeline, we found that there is multiple perspectives from which we can look at the data we have. Each one proposes a different approach with different results. We found in particular that:
Educational Institution. Indeed, the analysis showed that the main
Educational Institutionattacks where made using
Chemicalmaterial. Hence, looking at the coordinates of these attacks one can take the necessary measures to protect such facility, if deemed necessary. When developing this notebook we also notice that integrating the coordinates in the PCA didn't improve the clustering meaning that no correlation can be drawn from the coordinates and the remaining variables.
We restricted our attention to the
Taliban group mainly because it contained a large amount of datapoints, but any other group which is sufficiently represented can be choosen as well. We can then compare them to see if the conclusions we draw are specific to a group or if they are a shared component.