The next line reads the cvs file and stores as a sequnce-of-maps. So every elemnt in the sequence is a map, with one key per column.
We can print the column names as taking the keys of the first row.
To see the table nicely in the notebook, we convert it into hiccup, and render it to html.
So we have arround 155000 training cases
Let's see the distribution among the 5 different values for "Sentiment", so how many do we have for each ?
Word clouds allow a first glimpse into the text data, and we can see the distribution of words. First we do this for all texts, and then seperatedly for each senetiment value.
The word is as larger as more often it apperas. Very common stopwords are excluded from a list.
I use the oz library which uses vega/vega-lite specification to draw plots. The following is such a spec to draw word clouds given a sequence of text.
In order to create the vocabulary, we first need to tokenize the text and get overall counts for each token.
This can then be used to filter rare or very frequent tokens