Introduction to Machine Learning


This lab introduces some basic concepts of machine learning with Python. In this lab you will use the K-Nearest Neighbor (KNN) algorithm to classify the species of iris flowers, given measurements of flower characteristics. By completing this lab you will have an overview of an end-to-end machine learning modeling process.

By the completion of this lab, you will: 1. Follow and understand a complete end-to-end machine learning process including data exploration, data preparation, modeling, and model evaluation. 2. Develop a basic understanding of the principles of machine learning and associated terminology. 3. Understand the basic process for evaluating machine learning models.

Overview of KNN classification


Before discussing a specific algorithm, it helps to know a bit of machine learning terminology. In supervised machine learning a set of cases are used to train, test and evaluate the model. Each case is comprised of the values of one or more features and a label value. The features are variables used by the model to *predict the value of the label. Minimizing the errors between the true value of the label and the prediction supervises the training of this model. Once the model is trained and tested, it can be evaluated based on the accuracy in predicting the label of a new set of cases.

In this lab you will use randomly selected cases to first train and then evaluate a k-nearest-neighbor (KNN) machine learning model. The goal is to predict the type or class of the label, which makes the machine learning model a classification model.

The k-nearest-neighbor algorithm is conceptually simple. In fact, there is no formal training step. Given a known set of cases, a new case is classified by majority vote of the K (where @@0@@, etc.) points nearest to the values of the new case; that is, the nearest neighbors of the new case.

The schematic figure below illustrates the basic concepts of a KNN classifier. In this case there are two features, the values of one shown on the horizontal axis and the values of the other shown on the vertical axis. The cases are shown on the diagram as one of two classes, red triangles and blue circles. To summarize, each case has a value for the two features, and a class. The goal of the KNN algorithm is to classify cases with unknown labels.

Continuing with the example, on the left side of the diagram the @@1@@ case is illustrated. The nearest neighbor is a red triangle. Therefore, this KNN algorithm will classify the unknown case, '?', as a red triangle. On the right side of the diagram, the @@2@@ case is illustrated. There are three near neighbors within the circle. The majority of nearest neighbors for @@3@@ are the blue circles, so the algorithm classifies the unknown case, '?', as a blue circle. Notice that class predicted for the unknown case changes as K changes. This behavior is inherent in the KNN method.

**KNN for k = 1 and k = 3**

There are some additional considerations in creating a robust KNN algorithm. These will be addressed later in this course.

Examine the data set


In this lab you will work with the Iris data set. This data set is famous in the history of statistics. The first publication using these data in statistics by the pioneering statistician Ronald A Fisher was in 1936. Fisher proposed an algorithm to classify the species of iris flowers from physical measurements of their characteristics. The data set has been used as a teaching example ever since.

Now, you will load and examine these data which are in the statsmodels.api package. Execute the code in the cell below and examine the first few rows of the data frame.

Loading output library...
Loading output library...

There are four features, containing the dimensions of parts of the iris flower structures. The label column is the Species of the flower. The goal is to create and test a KNN algorithm to correctly classify the species.

Next, you will execute the code in the cell below to show the data types of each column.

Loading output library...

The features are all numeric, and the label is a categorical string variable.

Next, you will determine the number of unique categories, and number of cases for each category, for the label variable, Species. Execute the code in the cell below and examine the results.

Loading output library...
Loading output library...

You can see there are three species of iris, each with 50 cases.

Next, you will create some plots to see how the classes might, or might not, be well separated by the value of the features. In an ideal case, the label classes will be perfectly separated by one or more of the feature pairs. In the real-world this ideal situation will rarely, if ever, be the case.

There are six possible pair-wise scatter plots of these four features. For now, we will just create scatter plots of two variable pairs. Execute the code in the cell below and examine the resulting plots.

Note: Data visualization and the Seaborn package are covered in another lesson.

Loading output library...
Loading output library...

Examine these results noticing the separation, or overlap, of the label values.

Then, answer Question 1 on the course page.

Prepare the data set


Data preparation is an important step before training any machine learning model. These data require only two preparation steps:

  • Scale the numeric values of the features. It is important that numeric features used to train machine learning models have a similar range of values. Otherwise, features which happen to have large numeric values may dominate model training, even if other features with smaller numeric values are more informative. In this case Zscore normalization is used. This normalization process scales each feature so that the mean is 0 and the variance is 1.0.
  • Split the dataset into randomly sampled training and evaluation data sets. The random selection of cases seeks to limit the leakage of information between the training and evaluation cases.

The code in the cell below normalizes the features by these steps:

  • The scale function from scikit-learn.preprocessing is used to normalize the features.
  • Column names are assigned to the resulting data frame.
  • A statitical summary of the data frame is then printed.

Note: Data preparation with scikit-learn is covered in another lesson.

Execute this code and examine the results.

Loading output library...

Examine these results. The mean of each column is zero and the standard deviation is approximately 1.0.

The methods in the scikit-learn package requires numeric numpy arrays as arguments. Therefore, the strings indicting species must be re-coded as numbers. The code in the cell below does this using a dictionary lookup. Execute this code and examine the head of the data frame.

Loading output library...

Now, you will split the dataset into a test and evaluation sub-sets. The code in the cell below randomly splits the dataset into training and testing subsets. The features and labels are then separated into numpy arrays. The dimension of each array is printed as a check. Execute this code to create these subsets.

Note: Splitting data sets for machine learning with scikit-learn is discussed in another lesson.

Loading output library...

Train and evaluate the KNN model


With some understanding of the relationships between the features and the label and preparation of the data completed you will now train and evaluate a @@0@@ model. The code in the cell below does the following:

  • The KNN model is defined as having @@1@@.
  • The model is trained using the fit method with the feature and label numpy arrays as arguments.
  • Displays a summary of the model.

Execute this code and examine the summary of these results.

Note: Constructing machine learning models with scikit-learn is covered in another lesson.

Loading output library...

Next, you will evaluate this model using the accuracy statistic and a set of plots. The following steps create model predictions and compute accuracy:

  • The predict method is used to compute KNN predictions from the model using the test features as an argument.
  • The predictions are scored as correct or not using a list comprehension.
  • Accuracy is computed as the percentage of the test cases correctly classified.

Execute this code, examine the results, and answer Question 2 on the course page.

Loading output library...

The accuracy is pretty good.

Now, execute the code in the cell below and examine plots of the classifications of the iris species.

Loading output library...
Loading output library...
Loading output library...

In the plots above color is used to show the predicted class. Correctly classified cases are shown by triangles and incorrectly classified cases are shown by circles.

Examine the plot and answer Question 3 on the course page.



In this lab you have created and evaluated a KNN machine learning classification model. Specifically you have: 1. Loaded and explored the data using visualization to determine if the features separate the classes. 2. Prepared the data by normalizing the numeric features and randomly sampling into training and testing subsets. 3. Constructing and evaluating the machine learning model. Evaluation was performed by statistically, with the accuracy metric, and with visualization.