Loading output library...
Loading output library...
Loading output library...
Loading output library...

Read training data

#Read-training-data

The next line reads the cvs file and stores as a sequnce-of-maps. So every elemnt in the sequence is a map, with one key per column.

Loading output library...

We can print the column names as taking the keys of the first row.

Loading output library...
Loading output library...

To see the table nicely in the notebook, we convert it into hiccup, and render it to html.

Loading output library...
Loading output library...

So we have arround 155000 training cases

Let's see the distribution among the 5 different values for "Sentiment", so how many do we have for each ?

Loading output library...

Explorative data analysis

#Explorative-data-analysis

Word clouds

#Word-clouds

Word clouds allow a first glimpse into the text data, and we can see the distribution of words. First we do this for all texts, and then seperatedly for each senetiment value.

The word is as larger as more often it apperas. Very common stopwords are excluded from a list.

I use the oz library which uses vega/vega-lite specification to draw plots. The following is such a spec to draw word clouds given a sequence of text.

Loading output library...
Loading output library...
Loading output library...

All text word cloud

#All-text-word-cloud
Loading output library...

Word clouds for each sentiment

#Word-clouds-for-each-sentiment
Loading output library...
Loading output library...
Loading output library...
Loading output library...

In order to create the vocabulary, we first need to tokenize the text and get overall counts for each token.

This can then be used to filter rare or very frequent tokens

Loading output library...
Loading output library...
Loading output library...
Loading output library...