残差可视化

简介

当统计模型中残差对于鉴别统计模型的黑模型是一个分外重中之重之素。比如:线性回归模型就要求残差是同方差的,如果非是,则表明该模型不极端可靠,有或是非线性数据。本文基于回归模型使用不同之法子对残差进行可视化。

关系知识点

  • 深谙残差的定义与求解过程;
  • 见面使R语言进行回归(线性和逻辑)分析,并且使ggplot2展开可视化,会下dplyr和tidyr包进行数量操作。

一度能显示的可视化信息

以R中好直接动用plot()绘制合适的回归模型。下面是因此mtcars数据集进行操作,对每公里加仑数(miles per gallon for each
car (mpg))和力气(horsepower
(hp))进行回归分析并可视化模型和残差的消息。

代码:

fit <- lm(mpg ~ hp, data = mtcars)  # 拟合模型
summary(fit)  # 结果报告
#> 
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.7121 -2.1122 -0.8854  1.5819  8.2360 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
#> hp          -0.06823    0.01012  -6.742 1.79e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
#> F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(fit)  # Plot the model information

图片 1

par(mfrow = c(1, 1))  # Return plotting panel to 1 section

方的范使用的风俗的法来解释残差项并解释模型是否持有问题。现在,我们着想针对图片做有转移(和另行俱视觉吸引力)来展开补缺。

步骤

  1. 找到一个回归模型来预测变量Y;
  2. 取预测值和残差值与Y的每个观测值的关系;
  3. 针对Y
    的实际值和预测值进行可视化,以至于他们而别,但同时兼备联系;
  4. 应用残差做美学调整(例如:用红代表残差非常强之价值)来高亮模型的比差预测的接触。

简单线性回归

Step1:fit the model拟合模型

首先,我们血得拟合我们的模子。在本实例中,复制mtcars数据集至新的靶子d中,如下面的操作:

d <- mtcars
fit <- lm(mpg ~ hp, data = d)
summary(fit)

## Call:
##   lm(formula = mpg ~ hp, data = d)
## 
## Residuals:
##   Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
##   hp          -0.06823    0.01012  -6.742 1.79e-07 ***
##   ---
##   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Step2:obtain predicted and residual calues获得预测值和残差值

d$predicted <- predict(fit)   # 保存预测值
d$residuals <- residuals(fit) # 保存残差值

# 使用dplyr包快速查看实际值、预测值和残差值
library(dplyr)
d %>% select(mpg, predicted, residuals) %>% head()
#>                    mpg predicted  residuals
#> Mazda RX4         21.0  22.59375 -1.5937500
#> Mazda RX4 Wag     21.0  22.59375 -1.5937500
#> Datsun 710        22.8  23.75363 -0.9536307
#> Hornet 4 Drive    21.4  22.59375 -1.1937500
#> Hornet Sportabout 18.7  18.15891  0.5410881
#> Valiant           18.1  22.93489 -4.8348913

Step3:plot the actual and predicted calues可视化实际值和预测值

制图实际值散点图:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +  # 设置画布与结果变量y轴
  geom_point()  # 绘实际值散点

图片 2

绘图预测值散点图:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)  # Add the predicted values

图片 3

面的图很不便看起实际值和展望值关联,使用geom_segment()可以用实际值点和他么的展望值进行相应:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)

图片 4

据悉ggplot2的局部参数进行末之调,包括:

  • 主题,清除全部外观theme_bw();
  • 透过调整alpha透明度淡出连接线;
  • 使用geom_smooth()添加回归斜率。

    library(ggplot2)
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) + # Plot regression slope
    geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines
    geom_point() +
    geom_point(aes(y = predicted), shape = 1) +
    theme_bw() # Add theme for cleaner look

图片 5

Setp4:use residuals to adjust使用残差调整

末尾,我们盼望通过残差的价来高显示其值得大小。有坏多种方法可以达成目的。下面的代码给出了缓解方案:

# ALPHA
# 同过残差的绝对值来改变实际值得alpha值
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Alpha adjustments made here...
  geom_point(aes(alpha = abs(residuals))) +  # Alpha mapped to abs(residuals)
  guides(alpha = FALSE) +  # Alpha legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 6

 

可以看出,残差值越小,散点就更是透明。

 

# COLOR
# High residuals (in abolsute terms) made more red on actual values.
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)
  scale_color_continuous(low = "black", high = "red") +  # Colors to use here
  guides(color = FALSE) +  # Color legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 7

# SIZE AND COLOR
# Same coloring as above, size corresponding as well
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color AND size adjustments made here...
  geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped
  scale_color_continuous(low = "black", high = "red") +
  guides(color = FALSE, size = FALSE) +  # Size legend also removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 8

# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = residuals)) +  # Color mapped here
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +  # Colors to use here
  guides(color = FALSE) +

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 9

终极一个实例非常实用,因为颜料会助确定数据的非线性。如例子中所显示,从图中可以窥见hp较多的红色的极值(实际值比预计值大)。而以hp的中游价值部分可有双重多之蓝色,表明这些点处的实际值小于预测值。总的来说,表明两个变量间的关系是非线性的,在回归方程中上加二蹩脚项或会见再次好之建模。

多元回归

该实例中,我们采取马力(hp)、重量(wt)和动(disp)对各个公里加仑数据(mpg)进行回归:

# 选择感兴趣的数据
library(dplyr)
d <- mtcars %>% select(mpg, hp, wt, disp)

# 利用多元回归拟合模型
fit <- lm(mpg ~ hp + wt+ disp, data = d)

# 获得预测值和残差值
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

head(d)
#>                    mpg  hp    wt disp predicted  residuals
#> Mazda RX4         21.0 110 2.620  160  23.57003 -2.5700299
#> Mazda RX4 Wag     21.0 110 2.875  160  22.60080 -1.6008028
#> Datsun 710        22.8  93 2.320  108  25.28868 -2.4886829
#> Hornet 4 Drive    21.4 110 3.215  258  21.21667  0.1833269
#> Hornet Sportabout 18.7 175 3.440  360  18.24072  0.4592780
#> Valiant           18.1 105 3.460  225  20.47216 -2.3721590

制图实际值和预测值得散点图,并排除背景以及以那个连续:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +  # Lines to connect points
  geom_point() +  # Points of actual values
  geom_point(aes(y = predicted), shape = 1) +  # Points of predicted values
  theme_bw()

图片 10

无异于,我们可以下残差对图片进行优化,以便为明:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 11

至目前为止,代码和前中心无多异常的改变,可是,由于是多元回归,预测值不再是线性整齐的排了。

平等涂鸦利用多个预计椅子绘图

绘图一个单身的变量是毋庸置疑的,但是,多元回归的目的是钻多只变量。为了能够以采用多个预计因子,这里来只技术,就是运tidyr包中的gather()来拿富有数据化长数,多排成一列,然后以ggplot中之facet_*()分面来拿面板分开显示。

如果想要询问又多面的文化可以点击下面的连进行阅读:

Plotting background data for groups with
ggplot2

Plot some variables against many others with tidyr and
ggplot2

Quick plot of all
variables

OK,现在咱们来贯彻者的要求:

library(tidyr)
d %>% 
  gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>%  # 数据整形,将hp wt disp变为一列
  ggplot(aes(x = x, y = mpg)) +  # Note use of `x` here and next line
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +  # Split panels here by `iv`
  theme_bw()

图片 12

本咱们可尝试一下其他的数据集,比如,iris数据集,利用除了Sepal.Width变量的其余连续型变量来回归Sepal.Width变量:

head(iris)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

d <- iris %>% select(-Species)

# Fit the model
fit <- lm(Sepal.Width ~ ., data = iris)

# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

# Create plot
d %>% 
  gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%
  ggplot(aes(x = x, y = Sepal.Width)) +
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +
  theme_bw()

图片 13

经过者的图可以于变量的实际值和预测值。值得注意的凡起颜色之点为实际值,灰度标记的则为预测值。不发生所预期,相比叫前单一的回归,多元回归的预测值相对于实际值的生成减少了。

逻辑回归

今昔扩张及逻辑回归,要求有骨干的工作流(前面说了之手续),然后要取相应变量的预测值和残差值。下面的实例利用hp来预测V/S(vs),vs为二值元素,只有0和1:

# Step 1: Fit the data
d <- mtcars
head(d)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  predicted
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.69769525
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.69769525
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.88099419
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.69769525
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.02608156
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.76479499
##                     residuals
## Mazda RX4         -0.69769525
## Mazda RX4 Wag     -0.69769525
## Datsun 710         0.11900581
## Hornet 4 Drive     0.30230475
## Hornet Sportabout -0.02608156
## Valiant            0.23520501

fit <- glm(vs ~ hp, family = binomial(), data = d)

# Step 2: Obtain predicted and residuals
d$predicted <- predict(fit, type="response")
d$residuals <- residuals(fit, type = "response")

# Steps 3 and 4: plot the results
ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 14

假定我们只是怀念使记那些预测错误的项目,可以使用dplyr中的filter()函数实现:

ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point() +

  # > This plots large red circle on misclassified points
  geom_point(data = d %>% filter(vs != round(predicted)),
             color = "red", size = 2) +

  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 15

扩展:使用broom包

使broom包可简化以上步骤,将步骤1及步骤2联。broom包之意向是可以以统计分析对象转换成R中整的数据框结构。在本实例中,函数augment()的意是拿拟合好的回归模型转换成为可用的预测值和残差值得数据框,但是,需要留意以下几点:

  • 预测值和残差值的命名变为.fitted和.resid
  • 咱俩尚会获取额外的外的变量,这些变量需要去,可以以gather()函数和facet_*()函数实图片 16

    # install.packages(‘broom’)
    library(broom)
    library(dplyr)

    # Steps 1 and 2
    d <- lm(mpg ~ hp, data = mtcars) %>%
    augment()

    head(d)
    #> .rownames mpg hp .fitted .se.fit .resid .hat
    #> 1 Mazda RX4 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 2 Mazda RX4 Wag 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 3 Datsun 710 22.8 93 23.75363 0.8726286 -0.9536307 0.05102911
    #> 4 Hornet 4 Drive 21.4 110 22.59375 0.7772744 -1.1937500 0.04048627
    #> 5 Hornet Sportabout 18.7 175 18.15891 0.7405479 0.5410881 0.03675069
    #> 6 Valiant 18.1 105 22.93489 0.8026728 -4.8348913 0.04317538
    #> .sigma .cooksd .std.resid
    #> 1 3.917367 0.0037426122 -0.4211862
    #> 2 3.917367 0.0037426122 -0.4211862
    #> 3 3.924793 0.0017266396 -0.2534156
    #> 4 3.922478 0.0020997190 -0.3154767
    #> 5 3.927667 0.0003885555 0.1427178
    #> 6 3.820288 0.0369380489 -1.2795289

    # Steps 3 and 4
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) +
    geom_segment(aes(xend = hp, yend = .fitted), alpha = .2) + # Note .fitted
    geom_point(aes(alpha = abs(.resid))) + # Note .resid
    guides(alpha = FALSE) +
    geom_point(aes(y = .fitted), shape = 1) + # Note .fitted
    theme_bw()

图片 17

正文链接:http://www.cnblogs.com/homewch/p/5806405.html

原稿链接:https://drsimonj.svbtle.com/visualising-residuals