残差可视化

简介

在计算模型中残差对于鉴别总括模型的神秘模型是一个拾叁分第1的因素。比如:线性回归模型就须求残差是同方差的,假诺不是,则阐明该模型不太准确,有大概是非线性数据。本文基于回归模型使用不一致的艺术对残差举办可视化。

关系知识点

  • 熟练残差的定义和求解进度;
  • 会采用汉兰达语言进行回归(线性和逻辑)分析,并且利用ggplot2进展可视化,会利用dplyr和tidyr包进行多少操作。

已能展现的可视化消息

在奇骏中能够直接动用plot()绘制合适的回归模型。上面是用mtcars数据集进行操作,对每英里加仑数(miles per gallon for each
car (mpg))和力气(horsepower
(hp))进行回归分析并可视化模型和残差的音信。

代码:

fit <- lm(mpg ~ hp, data = mtcars)  # 拟合模型
summary(fit)  # 结果报告
#> 
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.7121 -2.1122 -0.8854  1.5819  8.2360 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
#> hp          -0.06823    0.01012  -6.742 1.79e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
#> F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(fit)  # Plot the model information

图片 1

par(mfrow = c(1, 1))  # Return plotting panel to 1 section

地点的模子使用的古板的点子来表达残差项并解释模型是还是不是享有失常态。未来,我们着想对图片做一些转移(和更俱视觉魅力)来展开填空。

步骤

  1. 找到二个回归模型来预测变量Y;
  2. 取得预测值和残差值与Y的各个观测值的关联;
  3. 对Y
    的实际值和预测值举行可视化,以至于他们可分别,但又有所联系;
  4. 运用残差做美学调整(例如:用海螺红代表残差格外高的值)来高亮模型的较差预测的点。

不难线性回归

Step1:fit the model拟合模型

率先,我们血必要拟合大家的模型。在本实例中,复制mtcars数据集到新的目的d中,如下边的操作:

d <- mtcars
fit <- lm(mpg ~ hp, data = d)
summary(fit)

## Call:
##   lm(formula = mpg ~ hp, data = d)
## 
## Residuals:
##   Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
##   hp          -0.06823    0.01012  -6.742 1.79e-07 ***
##   ---
##   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Step2:obtain predicted and residual calues拿到预测值和残差值

d$predicted <- predict(fit)   # 保存预测值
d$residuals <- residuals(fit) # 保存残差值

# 使用dplyr包快速查看实际值、预测值和残差值
library(dplyr)
d %>% select(mpg, predicted, residuals) %>% head()
#>                    mpg predicted  residuals
#> Mazda RX4         21.0  22.59375 -1.5937500
#> Mazda RX4 Wag     21.0  22.59375 -1.5937500
#> Datsun 710        22.8  23.75363 -0.9536307
#> Hornet 4 Drive    21.4  22.59375 -1.1937500
#> Hornet Sportabout 18.7  18.15891  0.5410881
#> Valiant           18.1  22.93489 -4.8348913

Step3:plot the actual and predicted calues可视化实际值和预测值

绘制实际值散点图:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +  # 设置画布与结果变量y轴
  geom_point()  # 绘实际值散点

图片 2

绘图预测值散点图:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)  # Add the predicted values

图片 3

上边的图很难看出实际值和预测值关联,使用geom_segment()可以将实际值点和他么的预测值进行相应:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)

图片 4

按照ggplot2的部分参数举行末了的调整,包含:

  • 宗旨,清除全部外观theme_bw();
  • 经过调整alpha光滑度淡出连接线;
  • 使用geom_smooth()添加回归斜率。

    library(ggplot2)
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) + # Plot regression slope
    geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines
    geom_point() +
    geom_point(aes(y = predicted), shape = 1) +
    theme_bw() # Add theme for cleaner look

图片 5

Setp4:use residuals to adjust使用残差调整

终极,大家盼望经过残差的值来高亮其值得大小。有很各种措施可以直达目标。上边的代码给出了解决方案:

# ALPHA
# 同过残差的绝对值来改变实际值得alpha值
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Alpha adjustments made here...
  geom_point(aes(alpha = abs(residuals))) +  # Alpha mapped to abs(residuals)
  guides(alpha = FALSE) +  # Alpha legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 6

 

可以看来,残差值越小,散点就越透明。

 

# COLOR
# High residuals (in abolsute terms) made more red on actual values.
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)
  scale_color_continuous(low = "black", high = "red") +  # Colors to use here
  guides(color = FALSE) +  # Color legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 7

# SIZE AND COLOR
# Same coloring as above, size corresponding as well
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color AND size adjustments made here...
  geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped
  scale_color_continuous(low = "black", high = "red") +
  guides(color = FALSE, size = FALSE) +  # Size legend also removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 8

# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = residuals)) +  # Color mapped here
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +  # Colors to use here
  guides(color = FALSE) +

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 9

最后多个实例格外实用,因为颜料可以协助鲜明数据的非线性。如例子中所示,从图中可以窥见hp较多的深象牙白的极值(实际值比预测值大)。而在hp的中等值部分却有越多的铅灰,声明那些点处的实际值小于预测值。总的来说,申明七个变量间的关联是非线性的,在回归方程中添加1次项大概会更好的建模。

多元回归

该实例中,咱们接纳马力(hp)、重量(wt)和移动(disp)对每英里加仑数据(mpg)进行回归:

# 选择感兴趣的数据
library(dplyr)
d <- mtcars %>% select(mpg, hp, wt, disp)

# 利用多元回归拟合模型
fit <- lm(mpg ~ hp + wt+ disp, data = d)

# 获得预测值和残差值
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

head(d)
#>                    mpg  hp    wt disp predicted  residuals
#> Mazda RX4         21.0 110 2.620  160  23.57003 -2.5700299
#> Mazda RX4 Wag     21.0 110 2.875  160  22.60080 -1.6008028
#> Datsun 710        22.8  93 2.320  108  25.28868 -2.4886829
#> Hornet 4 Drive    21.4 110 3.215  258  21.21667  0.1833269
#> Hornet Sportabout 18.7 175 3.440  360  18.24072  0.4592780
#> Valiant           18.1 105 3.460  225  20.47216 -2.3721590

绘图实际值和预测值得散点图,并免除背景和将其总是:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +  # Lines to connect points
  geom_point() +  # Points of actual values
  geom_point(aes(y = predicted), shape = 1) +  # Points of predicted values
  theme_bw()

图片 10

相同,我们可以利用残差对图纸进行优化,以便于了然:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 11

到近来为止,代码同前基本没有多大的更改,不过,由于是多元回归,预测值不再是线性整齐的排列了。

一回使用七个预测椅子绘图

绘制三个独门的变量是毋庸置疑的,然而,多元回归的目标是讨论两个变量。为了可以同时使用多少个预测因子,那里有个技巧,就是运用tidyr包中的gather()来将宽数据变成长数据,多列变为一列,然后选择ggplot中的facet_*()分面来将面板分开显示。

万一想要了解越来越多地点的学识可以点击上边的接连举办阅读:

Plotting background data for groups with
ggplot2

Plot some variables against many others with tidyr and
ggplot2

Quick plot of all
variables

OK,未来大家来兑现地点的渴求:

library(tidyr)
d %>% 
  gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>%  # 数据整形,将hp wt disp变为一列
  ggplot(aes(x = x, y = mpg)) +  # Note use of `x` here and next line
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +  # Split panels here by `iv`
  theme_bw()

图片 12

现行大家可以试一下其余的数据集,比如,iris数据集,利用除了Sepal.Width变量的别样再三再四型变量来回归Sepal.Width变量:

head(iris)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

d <- iris %>% select(-Species)

# Fit the model
fit <- lm(Sepal.Width ~ ., data = iris)

# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

# Create plot
d %>% 
  gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%
  ggplot(aes(x = x, y = Sepal.Width)) +
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +
  theme_bw()

图片 13

通过上边的图可以比较变量的实际值和预测值。值得注意的是有颜色的点为实际值,灰度标记的则为预测值。毫无意外,比较于前方单一的回归,多元回归的预测值相对于实际值的变迁收缩了。

逻辑回归

现行扩大到逻辑回归,须要部分为主的工作流(前面讲过的步调),然后要求领取相应变量的预测值和残差值。上面的实例利用hp来预测V/S(vs),vs为二值成分,唯有0和1:

# Step 1: Fit the data
d <- mtcars
head(d)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  predicted
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.69769525
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.69769525
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.88099419
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.69769525
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.02608156
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.76479499
##                     residuals
## Mazda RX4         -0.69769525
## Mazda RX4 Wag     -0.69769525
## Datsun 710         0.11900581
## Hornet 4 Drive     0.30230475
## Hornet Sportabout -0.02608156
## Valiant            0.23520501

fit <- glm(vs ~ hp, family = binomial(), data = d)

# Step 2: Obtain predicted and residuals
d$predicted <- predict(fit, type="response")
d$residuals <- residuals(fit, type = "response")

# Steps 3 and 4: plot the results
ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 14

一旦大家只是想要标记那么些预测错误的档次,能够应用dplyr中的filter()函数已毕:

ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point() +

  # > This plots large red circle on misclassified points
  geom_point(data = d %>% filter(vs != round(predicted)),
             color = "red", size = 2) +

  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 15

扩展:使用broom包

选拔broom包能够简化以上步骤,将步骤1和步骤2合并。broom包的功效是可以将计算分析对象转换到Odyssey中整齐的数据框结构。在本实例中,函数augment()的职能是将拟合好的回归模型转换到为可用的预测值和残差值得数据框,然而,必要小心以下几点:

  • 预测值和残差值的命名变为.fitted和.resid
  • 大家还会获取额外的其余的变量,这么些变量须要删除,可以动用gather()函数和facet_*()函数实图片 16

    # install.packages(‘broom’)
    library(broom)
    library(dplyr)

    # Steps 1 and 2
    d <- lm(mpg ~ hp, data = mtcars) %>%
    augment()

    head(d)
    #> .rownames mpg hp .fitted .se.fit .resid .hat
    #> 1 Mazda RX4 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 2 Mazda RX4 Wag 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 3 Datsun 710 22.8 93 23.75363 0.8726286 -0.9536307 0.05102911
    #> 4 Hornet 4 Drive 21.4 110 22.59375 0.7772744 -1.1937500 0.04048627
    #> 5 Hornet Sportabout 18.7 175 18.15891 0.7405479 0.5410881 0.03675069
    #> 6 Valiant 18.1 105 22.93489 0.8026728 -4.8348913 0.04317538
    #> .sigma .cooksd .std.resid
    #> 1 3.917367 0.0037426122 -0.4211862
    #> 2 3.917367 0.0037426122 -0.4211862
    #> 3 3.924793 0.0017266396 -0.2534156
    #> 4 3.922478 0.0020997190 -0.3154767
    #> 5 3.927667 0.0003885555 0.1427178
    #> 6 3.820288 0.0369380489 -1.2795289

    # Steps 3 and 4
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) +
    geom_segment(aes(xend = hp, yend = .fitted), alpha = .2) + # Note .fitted
    geom_point(aes(alpha = abs(.resid))) + # Note .resid
    guides(alpha = FALSE) +
    geom_point(aes(y = .fitted), shape = 1) + # Note .fitted
    theme_bw()

图片 17

正文链接:http://www.cnblogs.com/homewch/p/5806405.html

初稿链接:https://drsimonj.svbtle.com/visualising-residuals