#########################

WARNING

PolynomialFeatures(degree=d) transforms an array containing features into an array containing features (without the feature^0 ==1 or intercept), where n! is the factorial of n, equal to 1 × 2 × 3 × ... × n. Beware of the combinatorial explosion of the number of features!

a^3: one features with degree=3 ==>PolynomialFeatures OR poly_features.transform() ==> (1+3)! / (1! * 3!)==4 OR a^3, a^2, a, 1

a^2 : (1+2)! / 1! 2! ==3 ==> [1, a, a^2 ] e.g.

==Hiding a^0=1(*include_bias=False*)==> so

a^2 , b^2 ==> (2+2)! / 2!2! =6 e.g. **1**, a, **b**, a*b, **a^2**, b^2

#########################

### Learning Curves

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. For example, Figure 4-14 applies a 300-degree polynomial model to the preceding training data, and compares the result with a pure linear model and a quadratic model (2nddegree polynomial). Notice how the 300-degree polynomial model wiggles( 扭动) around to get as close as possible to the training instances.

```
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
for style, width, degree in ( ("y-", 1, 300), ("b--", 2,2), ("r-+", 2,1) ):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = Pipeline([
("poly_features", polybig_features),#adding new features base on degree and the specify feature
("std_scaler", std_scaler),
("lin_reg", lin_reg),
])
#https://blog.csdn.net/Linli522362242/article/details/103387527
# When you call the pipeline’s fit() method(learning data), it calls fit_transform() sequentially on all transformers,
# passing the output of each call as the parameter to the next call, until it reaches the final estimator,
# for which it just calls the fit() method.
#transform() contains fill data action
polynomial_regression.fit(X,y) # lin_reg.fit(X_poly,y) #fit_transform(X) then ...fit(X_poly,y)
y_newbig = polynomial_regression.predict(X_new) #lin_reg.predict # transform(X_new)...predict(X_new)
plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)
plt.plot(X, y, "b.", linewidth=3)
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3,3, 0,10])
plt.title("Figure 4-14. High-degree Polynomial Regression")
plt.show()
```

Of course, this high-degree Polynomial Regression model is severely **overfitting** the training data, while the linear model is **underfitting** it. **The model that will generalize best in this case is the quadratic model**. **It makes sense since the data was generated using a quadratic model, but in general you won’t know what function generated the data, **so how can you decide how complex your model should be? **How can you tell that your model is overfitting or underfitting the data?**

In (https://blog.csdn.net/Linli522362242/article/details/103587172) you **used cross-validation to get an estimate of a model’s generalization performance. If a model performs well (rmse is 0.0 in DecisionTreeRegressor) on the training data but generalizes poorly according to the cross-validation metrics(rmse is **70644.94463282847 **±2939 in DecisionTreeRegressor by using cross-validation(cv=10)** **), then your model is overfitting.**

**################################DecisionTreeRegressor**

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_label)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_label,housing_predictions)
tree_rmse = np.sqrt(tree_mse)####################
tree_rmse####################
```

Seem the mode is perfect??

```
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
```

** overfitting#################################DecisionTreeRegressor & Cross-Validation**

**If it performs poorly on both, then it is underfitting. This is one way to tell when a model is too simple or too complex**.

**##########################################LinearRegression**

```
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_label) #housing_label
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_label, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
```

```
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
```

**underfitting**

Okay, this is better than nothing but clearly not a great score: most districts’ **median_housing_values range between $120,000 and $264,000, so a typical prediction error of $68,628 is not very satisfying**. This is an example of a model **underfitting** the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough. As we saw in the previous chapter, the main ways to fix underfitting are to **select a more powerful model**, to feed the training algorithm with better features, or **to reduce the constraints on the model**. **This model is not regularized**, so this rules out the last option. You could try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does.**################################LinearRegression & Cross-Validation**

**Another way is to look at the ****learning curves**: these are plots of the model’s performance on the training set and the validation set as a function of the training set size. **To generate the plots, simply train the model several times on different sized subsets of the training set**. The following code defines a function that plots the learning curves of a model given some training data:

```
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_cures(model, X,y):
# X_validationSet#y_validationSet
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2, random_state=10)
train_errors, val_errors=[], []
for m in range( 1, len(X_train) ):#different size of training set
model.fit( X_train[:m], y_train[:m] )###############model
y_train_predict = model.predict( X_train[:m] )###############model
y_val_predict = model.predict( X_val )###############model
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
# indices of train_errors
# plt.plot(list(range(len(train_errors))),np.sqrt(train_errors), "r-+", linewidth=2, label="Training set")
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="Training set")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
lin_reg = LinearRegression()
plot_learning_cures(lin_reg, X,y)
plt.axis([0,80, 0,3])
plt.title("Figure 4-15. Learning curves")
plt.show()
```

This deserves a bit of explanation. First, let’s look at the performance on the training data: when there are just one or two instances in the training set, the model can fit them perfectly, which is why the curve starts at zero. But as new instances are added to the training set, it becomes impossible for the model to fit the training data perfectly, both because** the data is noisy and because it is not linear at all**. So** the error on the training data goes up until it reaches a plateau(高原; 平稳时期，稳定水平; 停滞期), at which point adding new instances to the training set doesn’t make the average error much better or worse.** Now let’s look at the performance of the model on the validation data. When the model is trained on very few training instances, **it is incapable of generalizing properly**, which is why **the validation error is initially quite big**. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again **a straight line cannot do a good job modeling拟合 the data, so the error ends up at a plateau, very close to the other curve**.

These learning curves are typical of an **underfitting** model. Both curves have reached a plateau; **they are close and fairly high(RMSE)**.

###############################

TIP**If your model is underfitting the training data**, adding more training examples will not help. You need to use a more complex

model or **come up with设法拿出** better features.

###############################

Now let’s look at the learning curves of a 10th-degree polynomial model on the same data (Figure 4-16):

```
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline((
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
))
plot_learning_cures(polynomial_regression, X,y)
plt.axis([0,80,0,3])
plt.title("Figure 4-16. Learning curves for the polynomial model")
plt.show()
```

These learning curves look a bit like the previous ones, but there are two very important

differences:

- The
**error (RMSE)**on the training data is**much lower than**with the Linear Regression model. - There is a
**gap(overfitting)**between the curves. This means that the model performs significantly better on the training data than on the validation data, which is**the hallmark显著特点 of an overfitting model**. However, if you used a**much larger training set**, the two curves would continue to get closer.

###################

TIP**One way to**** improve an overfitting model**** is to feed it**** more training data until the validation error reaches the training error.**

###################

######################################**THE BIAS/VARIANCE TRADEOFF**

An important theoretical result of statistics and Machine Learning is the fact that **a model’s generalization error** can be expressed as the sum of three very different errors:

**Bias偏差**： This part of the generalization error is due to wrong assumptions, such as**assuming that the data is linear when it is actually quadratic.**A**high-bias**model is most likely to*underfit***Variance方差**： This part is due to the model’s excessive sensitivity to small variations in the training data. A model with**many degrees of freedom多自由度**(such as a**high-degree**polynomial model) is likely to have**high variance**, and thus tothe training data.**overfit****Irreducible error不能减少的误差**: This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).**Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity****increases its bias and reduces its variance.**This is why it is called a tradeoff.

######################################

*****************************************************************************extra materials

**regularization **is one approach to tackle the problem of **overfitting** by **adding additional information**, and thereby **shrinking the parameter values of the model** to **induce a penalty against complexity针对复杂性引入惩罚项**.

**Ridge regression **is an L2 penalized model where we simply add the **squared sum of the weights** to our **least-squares cost function（****is different with MSE(Mean Square of Error) **** since ****）**:

https://towardsdatascience.com/ridge-regression-for-better-usage-2f19b3a202db

We’ve seen the equation of multiple linear regression both in general terms and matrix version. It can be written in another version as follows.

Here* argmin* means ‘Argument of Minimum’ that makes the function attain the minimum. In the context, it finds the *β*’s that **minimize the RSS(**Residual Sum of Squares **)**. And we know how to get the *β*’s from the matrix formula(). Now, the question becomes “What does this have to do with ridge regression?”.

‘L2 norm’:

https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b

In the equation (1.1) above, we have shown the linear model based on the **n number of features.** Considering only a single feature as you probably already have understood that *w[0]* will be slope and *b（or *** ） will represent intercept.** Linear regression looks for optimizing

*w*and

*b*such that it minimizes the cost function. The cost function can be written as

In the equation above I have assumed the data-set has

**M instances and p features**.

**Note j=0, w0 = b =**=w0 x1= w0 * x0

*Ridge Regression :** *In ridge regression, the cost function is altered by **adding a penalty** equivalent to **square of the magnitude of the coefficients**.

The **hyperparameter **** **controls** ****how much**** you want to regularize the model**. If ** = 0 **then Ridge Regression is just **Linear Regression**. If ** is very large**, then **all weights end up very close to zero **and **the result is a flat line going through the data’s mean.**

**p features Note j=0, w0 = b =** =w0 x1= w0 * x0

This is equivalent to saying minimizing the cost function in equation 1.2 under the condition as below, C is a constant

By **increasing** the value of the **hyperparameter** (the following uses), we** increase** the **regularization strength(for **tackling the problem of **overfitting****)** and **shrink the weights of our model(or reduce the value of wj or bj) since we want to **make the function attain the minimum( including ). ** Note sometimes the value of wj or bj is not steady or maybe too large. **

Please note that we don't regularize the intercept term or （the following uses ）.(we need to **correct the last item in equation 1.3, by changing j=1** )

*****************************************************************************

**Regularized Linear Models**

a good way to** reduce overfitting is to regularize** the model (i.e., to **constrain** it): **the fewer degrees of freedom it has(lower variance), the harder it will be for it to overfit the data.** For example, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees. **For a linear model**, regularization is typically achieved by __ constraining the weights of the model__(

**B is a weight or**

**).**We will now look at

**Ridge Regression, Lasso Regression, and Elastic Net**, which implement three different ways to constrain the weights.

### Ridge岭 Regression

Ridge Regression (also called * Tikhonov regularization*)

**is a regularized version of Linear Regression**:

*a*equal to is added to the cost function().

**regularization term****This forces the learning algorithm to not only fit the data but also**

*keep the model weights*

__. Note that the regularization term__*as small as possible***should only be added to the cost function during training. Once the model is trained, you want to evaluate the model’s performance using the unregularized performance measure**.

#############################################################

NOTE

It is quite common for the

**cost function**used during

**training**to

*the*

**be different from****performance measure**used for

**testing**. Apart from regularization, another reason why they might be different is that

**a good training cost function**should have

**optimization-friendly derivatives优化过程中易于求导**, while the performance measure used for testing should be

**as close as possible to the final objective接近最后的客观表现**. A good example of this is a classifier trained using a cost function such as

**the log loss**(discussed in a moment) but

**evaluated using precision/recall**.

#############################################################

The **hyperparameter α**** **controls** ****how much**** you want to regularize the model**. If **α = 0 **then Ridge Regression is just **Linear Regression**. If **α is very large**, then **all weights end up very close to zero **and **the result is a flat line going through the data’s mean.** Equation 4-8 presents the Ridge Regression cost function:

Equation 4-8. Ridge Regression cost function

**Note: is for convenient computation, , we should remove it for Practical application**

Note that the bias term **is not regularized **(the sum starts at i = 1, not 0). If we define **w** as the vector of feature weights ( to ), then **the regularization term**** is simply equal to** , where represents the ℓ2 norm of the weight vector. https://blog.csdn.net/Linli522362242/article/details/103387527

For Gradient Descent, just add 2**αw** to the MSE gradient vector (Equation 4-6).

+2**αw** (or +2**α****)**

########################################

is , and use RSS(Residuals of Sum Squares) instead of MSE, the correct formula:

WARNING

It is important to scale the data (e.g., using a** StandardScaler) before** performing** Ridge Regression, as it is sensitive to the scale of the input features.** This is true of most regularized models.

########################################

```
np.random.seed(42)
m=20 #number of instances
X= 3*np.random.rand(m,1) #one feature #X=range(0,3)
#noise
y=1 + 0.5*X + np.random.randn(m,1)/1.5
X_new = np.linspace(0,3, 100).reshape(100,1)
from sklearn.linear_model import Ridge
def plot_model(model_class, polynomial, alphas, **model_kargs):
for alpha, style in zip( alphas, ("b-", "y--", "r:") ):
model = model_class(alpha, **model_kargs) if alpha>0 else LinearRegression()
if polynomial:
model = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("std_scaler", StandardScaler()),
("regul_reg", model) #regulized regression
])
model.fit(X,y)
y_new_regul = model.predict(X_new)
lw = 5 if alpha>0 else 1
plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha) )
plt.plot(X,y, "b.", linewidth=3)
plt.legend(loc="upper left", fontsize=15)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0,3, 0,4])
plt.figure(figsize=(8,4) )
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0,10,100), random_state=42)#plain Ridge models are used, leading to
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0,10**-5, 1), random_state=42)
plt.title("Figure 4-17. Ridge Regression")
plt.show()
```

If **α = 0 **then Ridge Regression is just **Linear Regression**.

If **α is very large**, then **all weights end up**** very close to zero**** **and **the result is a flat line going through the data’s mean**

Figure 4-17 shows several Ridge models trained on some linear data using different α value. On the left, **plain Ridge models** are used, leading to **linear predictions**. On the right, the data is first expanded using **PolynomialFeatures**(degree=10), then it is scaled using a **StandardScaler**, and finally the **Ridge models are applied** to the resulting features: this is Polynomial Regression with Ridge regularization. Note how **increasing α leads to flatter **(i.e., less extreme, more reasonable) predictions; this **reduces the model’s variance but increases its bias**.

As with **Linear Regression**, we can perform Ridge Regression either by **computing a closed-form equation** or by performing **Gradient Descent**. The pros and cons are the same. Equation 4-9 shows the closed-form solution (where A is the n × n **identity**单位** matrix** **A(** A square matrix full of 0s except for 1s on the main diagonal (top-left to bottom-right). **)** except with a **0 in the top-left cell**, **corresponding to the bias term** ).

Equation 4-9. Ridge Regression closed-form solution

Note: **Normal Equation(****a closed-form equation****)**

Here is how to perform Ridge Regression with Scikit-Learn using a **closed-form solution** (a variant of Equation 4-9 using a matrix factorization technique by André-Louis Cholesky):

```
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X,y)
ridge_reg.predict([[1.5]])
```

```
ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X,y)
ridge_reg.predict([[1.5]])
```

And using **Stochastic Gradient Descent**:**Note**: to be future-proof, we set `max_iter=1000`

and `tol=1e-3`

because these will be the default values in Scikit-Learn 0.21.

```
#small case of L
sgd_reg = SGDRegressor( penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X,y.ravel())
```

`sgd_reg.predict([[1.5]])`

The **penalty** hyperparameter sets **the type of regularization term** to use. Specifying "**l2**" indicates that you want SGD to add a regularization term to the cost function equal to **half the square of the ℓ2 norm of the weight vecto**r: this is simply Ridge Regression.

**This forces the learning algorithm to not only fit the data but also keep the model weights**

*as small as possible*### Lasso Regression

*Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression)* is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but** it uses the ℓ1 norm** of the weight vector instead of half the square of the ℓ2 norm (see Equation 4-10).

Equation 4-10. Lasso Regression cost function

OR

OR

is , and use RSS(Residuals of Sum Squares) instead of MSE, the correct formula:

Figure 4-18 shows the same thing as Figure 4-17 but replaces Ridge models with Lasso models and uses **smaller α values**.

```
from sklearn.linear_model import Lasso
plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)
plt.title("Figure 4-18. Lasso Regression")
plt.show()
```

An important characteristic of** Lasso Regression** is that it **tends to completely eliminate the weights of the least important features最不重要 (i.e., set them to zero)**. For example, the dashed line in the right plot on Figure 4-18 (**with α = 10-7**) looks quadratic, almost linear(compared to Figure 4-17 ):since **all the weights for the high-degree polynomial features are equal to zero**. In other words, Lasso Regression **automatically performs feature selection and outputs a sparse model **(i.e., **with ***few* **nonzero feature weights**).

J <== at each point of coordinate axis ,

N1 <== L1 norm: e.g.

N2 <== L2 norm: sqrt( ) e.g.

t_init is the initial value of ,

```
t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5
# ignoring bias term
theta1s = np.linspace(t1a, t1b, 500) # 500 points
theta2s = np.linspace(t2a, t2b, 500) # 500 points
theta1, theta2 = np.meshgrid(theta1s, theta2s)
#theta1: 500pts*500 500 points are copied along the points on coordinate axis theta2
#theta2: 500pts*500 500 points are copied along the points on coordinate axis theta1
theta = np.c_[theta1.ravel(), theta2.ravel()] #(250000, 250000) #corresponding to (features1, feature2)
Xr = np.array([
[-1, 1], #[feature1, feature2]
[-0.3, -1], #[feature1, feature2]
[1, 0.1] #[feature1, feature2]
])
#Xt.T
#feature1, ..., feature1
#feature2, ..., feature2
#...
#yi = 2 * Xi1 + 0.5 * Xi2
yr = 2*Xr[:, :1] + 0.5*Xr[:, 1:] #instances' labels are stored in multiple rows & one column
#yr.T
#array([
# [label1, label2,...]
# ])
#MSE #hiding: shape(250000, 3)-shape(1, 3)*250000
J = ( 1/len(Xr) *np.sum(( theta.dot(Xr.T) -yr.T )**2, axis=1 )).reshape(theta1.shape) # MSE at (500,500)
N1 = np.linalg.norm(theta, ord=1, axis=1).reshape(theta1.shape)#L1 norm: row_i = sum(|theta_i1| + |theta_i2|)
N2 = np.linalg.norm(theta, ord=2, axis=1).reshape(theta1.shape)#L2 norm: row_i=sqrt(sum(theta_i1 **2 + theta_i2 **2) )
# np.argmin(J): minimum value's index==166874 after the J being flatten
theta_min_idx = np.unravel_index( np.argmin(J), J.shape) #get the index==(333, 374) in orginal J (without being flatten)
theta1_min, theta2_min=theta1[theta_min_idx],theta2[theta_min_idx] #get the corresponding minimum thata values
#theta1_min, theta2_min #(1.9979959919839678, 0.5020040080160322)
t_init = np.array([[0.25], #theta1
[-1] #theta2
]) #start point
```

**Lasso gradient vector:**

It is important to scale the data **before** performing** Ridge Regression, as it is sensitive to the scale of the input features.** This is true of most regularized models.

**Ridge gradient vector :**

The penalty term without using the term ** for convenient computation**

Note: here ==

```
#L1 norm, L2 norm
def bgd_path(theta, X, y, l1_alpha, l2_alpha, core=1, eta=0.1, n_iterations=50):
path = [theta]
for iteration in range(n_iterations):
#formula gradients = 2/m * X_b.T.dot( X_b.dot(theta)-y ) # 2*
gradients = core* 2/len(X) * X.T.dot( X.dot(theta)-y ) + l1_alpha*np.sign(theta) + 2*l2_alpha*theta
theta = theta -eta*gradients # l1: alpha #l2: alpha
path.append(theta)
return np.array(path)
```

Equation 4-8. Ridge Regression cost function

**Note: is for convenient computation, we should remove it for Practical application **

```
plt.figure( figsize=(12,8) )
#initial theta=0 in Lasso, initial theta=1 in Ridge
for i, N, l1_alpha, l2_alpha, title in ( (0, N1, 0.5, 0, "Lasso"), (1, N2, 0, 0.1, "Ridge") ):
#J:unregularized MSE cost function; JR:regularized MSE cost function
JR = J + l1_alpha*N1 + l2_alpha* N2**2 #cost function
#get current minimum of theta1 and theta2 in regularized MSE cost function
thetaR_min_idx = np.unravel_index(np.argmin(JR), JR.shape)
theta1R_min, theta2R_min = theta1[thetaR_min_idx], theta2[thetaR_min_idx]
#contour levels
#min-max scaling x_new=x-min/(max-min) ==> x=x_new*(max-min)+min
levelsJ = (np.exp(np.linspace(0,1,20)) -1) *( np.max(J)-np.min(J) ) + np.min(J)
levelsJR =(np.exp(np.linspace(0,1,20)) -1) *( np.max(JR)-np.min(JR) ) + np.min(JR)
levelsN = np.linspace(0, np.max(N), 10)
path_J= bgd_path(t_init, Xr, yr, l1_alpha=0, l2_alpha=0) #an unregularized MSE cost function(α = 0)
path_JR = bgd_path(t_init, Xr, yr, l1_alpha, l2_alpha) #a regularized MSE cost function
path_N = bgd_path(t_init, Xr, yr, np.sign(l1_alpha)/3, np.sign(l2_alpha), core=0)
plt.subplot(221 + i*2)
plt.grid(True)
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
#J:unregularized MSE cost function
#the background contours (ellipses) represent an unregularized MSE cost function(α = 0)
plt.contourf(theta1, theta2, J, levels=levelsJ, alpha=0.9)
#The foreground contours (diamonds) represent the ℓ1 penalty
plt.contour(theta1, theta2, N, levels=levelsN)
plt.plot(path_J[:,0], path_J[:,1], "w-o")#the white circles show the Batch Gradient Descent path that cost function
plt.plot(path_N[:,0], path_N[:,1], "y-^")#the triangles show the BGD path for this penalty only (α → ∞).
plt.plot(theta1_min, theta2_min, "bs") #minimum values of theta1 , theta2
plt.title(r"$\ell_{}$ penalty".format(i+1), fontsize=16)
plt.axis([t1a,t1b, t2a,t2b])
if i==1:
plt.xlabel(r"$\theta_1$", fontsize=20)
plt.ylabel(r"$\theta_2$", fontsize=20, rotation=0)
plt.subplot(222 + i*2)
plt.grid(True)
plt.axhline(y=0, color="k")
plt.axvline(x=0, color="k")
#JR:regularized MSE cost function
plt.contourf(theta1, theta2, JR, levels=levelsJR, alpha=0.9)
plt.plot(path_JR[:,0], path_JR[:,1], "w-o")#the white circles show the Batch Gradient Descent path that cost func
plt.plot(theta1R_min, theta2R_min, "bs") #minimum values of theta1 , theta2
plt.title(title, fontsize=16)
plt.axis([t1a,t1b, t2a,t2b])
if i ==1:
plt.xlabel(r"$\theta_1$", fontsize=20)
plt.show()
```

You can get a sense of why this is the case by looking at above Figure: on the **top-left** plot, the background contours (ellipses) represent an unregularized MSE cost function (α = 0), and the **white circles show the Batch Gradient Descent path with that cost function. **The foreground contours (diamonds) represent the ℓ1 penalty, and the yellow triangles show the BGD path for this penalty only (α → ∞). Notice how the path first reaches θ1 = 0, then rolls down a gutter until it reaches θ2 = 0.

On **the top-right** plot, the contours represent the same cost function plus an ℓ1 penalty with α = 0.5. **The global minimum is on the θ2 = 0 axis**. **BGD first reaches θ2 = 0(** Lasso Regression **automatically performs feature selection and outputs a sparse model **(i.e., **with ***few* **nonzero feature weights**) **)**, then rolls down the gutter until it reaches the global minimum.

The two bottom plots show the same thing but uses an ℓ2 penalty(with α = 0.1) instead. The regularized minimum is closer to θ = 0 than the unregularized minimum ,** but(On Ridge) the weights do not get fully eliminated 但权重不能完全消除**.

################################

TIP

On the** Lasso cost function**, the **BGD path **tends to **bounce反弹** across the gutter toward the end. This is because the slope changes abruptly at **θ2 = 0****(** Lasso Regression **automatically performs feature selection and outputs a sparse model **(i.e., **with ***few* **nonzero feature weights**) **)**. You **need to gradually reduce the learning rate in order to actually converge to the global minimum.**

################################

The** Lasso cost function is not differentiable无法进行微分运算 at θi = 0** (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g instead when any θi = 0. Equation 4-11 shows a subgradient vector equation you can use for Gradient Descent with the Lasso cost function.

Equation 4-11. Lasso Regression subgradient vector

Here is a small Scikit-Learn example using the Lasso class. Note that you could instead use an SGDRegressor(penalty="l1").

```
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X,y)
```

```
lasso_reg.predict(([[1.5]
])
)#lasso_reg.intercept_, lasso_reg.coef_#(array([1.14537356]), array([0.26167212]))
```

# 1.14537356+ 0.26167212*1.5=1.53788174

```
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l1", eta0=0.5, random_state=42)
sgd_reg.fit(X,y)
```

`sgd_reg.predict([[1.5]])#sgd_reg.intercept_, sgd_reg.coef_ #(array([0.94102563]), array([0.4133915]))`

# 0.94102563 + 0.4133915 * 1.5 = 1.56111288**Elastic Net**

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. **When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression** (see Equation 4-12).

Equation 4-12. Elastic Net cost function

So when should you use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net? **It is almost always preferable to have at least a little bit of regularization**, so generally you should avoid plain Linear Regression. Ridge is a good default, but **if you suspect怀疑 that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed**. In general, **Elastic Net is preferred over Lasso** since **Lasso may behave erratically不规律地 when the number of features is greater than the number of training instances or when several features are strongly correlated**.

Here is a short example using Scikit-Learn’s ElasticNet (**l1_ratio corresponds to the mix ratio r**):

```
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X,y)
```

`elastic_net.predict([[1.5]])#elastic_net.intercept_, elastic_net.coef_ # (array([1.08639303]), array([0.30462619]))`

### Early Stopping

A very different way to regularize iterative learning algorithms such **as Gradient Descent is to stop training as soon as the validation error reaches a minimum**. This is called **early stopping**. Figure 4-20 shows a complex model (in this case a **high-degree Polynomial Regression model**) being trained using Batch Gradient Descent.** As the epochs go by,** **the algorithm learns and its prediction error (RMSE) on the training set naturally goes down**, and so does its prediction error on the validation set. However, **after a while the validation error stops decreasing and actually starts to go back up**. This indicates that** the model has started to overfit the training data**. With early stopping you just stop training as soon as the validation error reaches the minimum. It is such a simple and efficient regularization technique that Geoffrey Hinton called it a “beautiful free lunch.”

```
np.random.seed(42)
m=100
X = 6*np.random.rand(m,1)-3
y = 2+X + 0.5 * X**2 + np.random.randn(m,1)
X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)
poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler())
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
#use the mean and variance which is from StandardScaler() of poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
#Note that with warm_start=True, when the fit() method is called, it just continues
#training where it left off instead of restarting from scratch.
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty,warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)
n_epochs=500
train_errors, val_errors=[],[]
for epoch in range(n_epochs):
sgd_reg.fit(X_train_poly_scaled, y_train)
y_train_predict = sgd_reg.predict(X_train_poly_scaled)
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
train_errors.append(mean_squared_error(y_train, y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])
plt.annotate("Best model",
xy=(best_epoch, best_val_rmse),
xytext = (best_epoch, best_val_rmse+1),
ha="center",
arrowprops=dict(facecolor="black", shrink=0.05),
fontsize=16,
)
best_val_rmse -= 0.03 # just to make the graph look better (move the horizontal blackline down -0.03)
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2) # horizontal black line
#hiding: list(range(0, n_epochs,1)),
plt.plot( np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot( np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.title("Figure 4-20. Early stopping regularization")
plt.show()
```

#################################

TIP**With Stochastic and Mini-batch Gradient Descent**, **the curves are not so smooth**, and it may be hard to know whether you have reached the minimum or not. **One solution is to stop only after the validation error has been above the minimum for some time ****(when you are confident that the model will not do any better),** **then roll back the model parameters to the point where the validation error was at a minimum.**

#################################

Here is a basic implementation of early stopping:

```
from sklearn.base import clone
poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler())
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
#use the mean and variance which is from StandardScaler() of poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
#Note that with warm_start=True, when the fit() method is called, it just continues training where it left
#off instead of restarting from scratch.
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train)
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)
```

`best_epoch, best_model`

Logistic Regression

some regression algorithms can be used for classification as well (and vice versa). Logistic Regression (also called Logit Regression) is commonly used to **estimate the probability that an instance belongs to a particular class **(e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

### Estimating Probabilities

So how does it work? Just like a Linear Regression model, a **Logistic Regression model computes a weighted sum of the input features (plus a bias term)**, but instead of outputting the result directly like the Linear Regression model does, **it outputs the logistic of this result** (see Equation 4-13).

Equation 4-13. Logistic Regression model estimated probability (vectorized form)

The logistic—also called the logit, noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. It is defined as shown in Equation 4-14 and Figure 4-21.

Equation 4-14. Logistic function

```
t = np.linspace(-10, 10, 100)
sigmoid = 1/( 1+np.exp(-t) )
plt.figure(figsize=(9,3))
plt.plot([-10, 10], [0,0], 'k-') #horizontal line/axis
plt.plot([-10, 10], [0.5, 0.5], "k:") #horizontal dotted line y=0.5
plt.plot([-10, 10], [1,1], "k:") #horizontal dotted line y=1.0
plt.plot([0,0], [-1.1, 1.1], "k-") #vertical line x=0.0
plt.plot(t, sigmoid, "b-", linewidth=2, label=r"$\sigma(t)=\frac{1}{1+e^{-t}}$")
plt.xlabel("t")
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10,10, -0.1,1.1])
plt.title("Figure 4-21. Logistic function")
plt.show()
```

Once the Logistic Regression model has estimated the probability that an instance x belongs to the positive class, it can make its prediction ŷ easily (see Equation 4-15).

Equation 4-15. Logistic Regression model prediction

Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model predicts 1 if is positive, and 0 if it is negative.

Note: and t= and

### Training and Cost Function

Good, now you know how a Logistic Regression model estimates probabilities and makes predictions. But how is it trained? The objective of training is to **set the parameter vector θ** so that the model **estimates high probabilities** for positive instances (y =1) and **low probabilities **for negative instances (y = 0).** This idea is captured by the cost function** shown in Equation 4-16 for a single training instance x.

Equation 4-16. Cost function of a single training instance

OR

This cost function makes sense because **–log(t) grows very large when t approaches 0**** ****(e.g. -log(0.1)==1, -log(0.0000000001)==10**

**, so the cost**

**will be large if the model estimates a probability**

**, and it**

*close to 0*for a positive instance(if y=1)**will also be very large if the model estimates a probability**On the other hand,

*close to 1*for a negative instance(if y=0).**– log(t)**is

**close to 0 when t is close to 1**, so

**the cost will be close to 0 if the estimated probability is close to 0 for a negative instance(if y=0) ; will be close to 0 if the estimated probability is close to 1 for a positive instance(if y=1),**which is precisely what we want.

The cost function over the whole training set is simply **the average cost over all training instances**. It can be written in a single expression (as you can verify easily), called the** log loss**, shown in Equation 4-17

Equation 4-17. Logistic Regression cost function (log loss)

The bad news is that there is **no known closed-form equation** to compute the value of **θ** that** minimizes this cost function** (there is no equivalent of the Normal Equation). But the good news is that this **cost function is convex,** **so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum** ( , **Once the left theta == right theta, the **====0, we at get the minimum of cost function(see above right figure), **we should replace with The partial derivatives of the cost function**)

(if the learning rate is not too large and you wait long enough). The partial derivatives of the cost function with regards to the jth model parameter is given by Equation 4-18.

Equation 4-18. Logistic cost function partial derivatives

############################################

and

Next:

############################################

Equation 4-18. Logistic cost function partial derivatives

This equation(Equation 4-18) looks very much like Equation 4-5: for each instance it computes **the prediction error** and multiplies it by the **jth feature value**, and then it computes the average over all training instances. Once you have the gradient vector containing **all the partial derivatives** you can use it in the Batch Gradient Descent algorithm. That’s it: you now know how to train a Logistic Regression model. For Stochastic GD you would of course just take one instance at a time, and for Mini-batch GD you would use a mini-batch at a time.

### Decision Boundaries

注意：本文归作者所有，未经作者允许，不得转载