Let's add some labels after importing too.
This looks promising...
Our data can be represented @@0@@, for all @@1@@
We can use a linear model to represent our data. A single point would be represented (@@2@@ is a random variable representing the error): @@3@@
What we want to do is find constants for @@4@@ that will best predict @@5@@ for @@6@@ (minimise the error).
Our model, error, squared error:
We want to minimise our total error (loss), so we define an error function that sums over all of our data and returns the total squared error:
Then, we minimise this function. We take partial derivatives for @@9@@ and @@10@@, set them equal to zero and solve (chain rule).
Setting equal to zero, and solving we get:
So, to minimise our square total squared error, thus determining coefficients for our model chose m and c determined by the above equations.
What if we want to fit more than one one variable? What if our data isn't randomly generated to be perfectly linear? It turns out any degree polynomial is linear with respect to it's coefficients, so a similar method will work! We just need to re-formulate the problem slightly.
Let's say we have some data with n features, we can represent them as the vector of feature vectors, @@0@@ and their coefficients, @@1@@:
We can express our problem (generally) as finding the vector @@3@@ such that @@4@@ is fitted to our data.
The second formula is the general form for @@6@@ observtations. We can see that with @@7@@ strongly resembles our model from earlier:
Note that there is no explicit equivalent to @@9@@ has dissapeared, this is because each @@10@@ is a vector containing the data for that feature. The parameters, @@11@@ are constants.
@@13@@ and @@14@@ are of length @@15@@ (we have @@16@@ data points). @@17@@ is defined as all ones, we will need to manually add this to our data. This ensures we keep our y intercept (@@18@@)
We can express the above succinctly using the dot product of the matrices @@19@@. This is our model.
Each row of the dot product represents the m'th data point for all observations.
We can then rewrite our input data as observations and target data:
Because we are using a dataframe, we can just note our target data is labeled 'y'
Next, our loss (error) function, defined for parameters @@1@@ is calculated as the difference of the response from our model versus the target data.:
In the same manner as before, we take the partial derivative with respect to @@3@@ (Chain rule, product rule)
This gives us the gradient at the m'th data point for all observations