残差可视化

简介

当统计模型中残差对于鉴别统计模型的秘闻模型是一个那个重要的因素。比如:线性回归模型就要求残差是同方差的,如果未是,则表明该模型不顶准确,有或是非线性数据。本文基于回归模型使用不同之办法对残差进行可视化。

事关知识点

  • 熟识残差的概念与求解过程;
  • 见面利用R语言进行回归(线性和逻辑)分析,并且采用ggplot2展开可视化,会用dplyr和tidyr包进行数量操作。

就能显示的可视化信息

以R中可一直采用plot()绘制合适的回归模型。下面是故mtcars数据集进行操作,对每公里加仑数(miles per gallon for each
car (mpg))和力气(horsepower
(hp))进行回归分析并可视化模型与残差的音。

代码:

fit <- lm(mpg ~ hp, data = mtcars)  # 拟合模型
summary(fit)  # 结果报告
#> 
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.7121 -2.1122 -0.8854  1.5819  8.2360 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
#> hp          -0.06823    0.01012  -6.742 1.79e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
#> F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(fit)  # Plot the model information

图片 1

par(mfrow = c(1, 1))  # Return plotting panel to 1 section

方的模型使用的风俗人情的道来分解残差项并解释模型是否富有问题。现在,我们考虑针对图片做有变动(和重俱视觉吸引力)来展开补给。

步骤

  1. 找到一个回归模型来预测变量Y;
  2. 获取预测值和残差值与Y的每个观测值的涉嫌;
  3. 针对Y
    的实际值和展望值进行可视化,以至于他们只是分别,但又有联系;
  4. 采取残差做美学调整(例如:用革命表示残差非常高之价)来高亮模型的比较差预测的触发。

简单线性回归

Step1:fit the model拟合模型

第一,我们血得拟合我们的范。在本实例中,复制mtcars数据集及新的对象d中,如下面的操作:

d <- mtcars
fit <- lm(mpg ~ hp, data = d)
summary(fit)

## Call:
##   lm(formula = mpg ~ hp, data = d)
## 
## Residuals:
##   Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
##   hp          -0.06823    0.01012  -6.742 1.79e-07 ***
##   ---
##   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Step2:obtain predicted and residual calues获得预测值和残差值

d$predicted <- predict(fit)   # 保存预测值
d$residuals <- residuals(fit) # 保存残差值

# 使用dplyr包快速查看实际值、预测值和残差值
library(dplyr)
d %>% select(mpg, predicted, residuals) %>% head()
#>                    mpg predicted  residuals
#> Mazda RX4         21.0  22.59375 -1.5937500
#> Mazda RX4 Wag     21.0  22.59375 -1.5937500
#> Datsun 710        22.8  23.75363 -0.9536307
#> Hornet 4 Drive    21.4  22.59375 -1.1937500
#> Hornet Sportabout 18.7  18.15891  0.5410881
#> Valiant           18.1  22.93489 -4.8348913

Step3:plot the actual and predicted calues可视化实际值和预测值

制图实际值散点图:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +  # 设置画布与结果变量y轴
  geom_point()  # 绘实际值散点

图片 2

绘图预测值散点图:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)  # Add the predicted values

图片 3

面的希冀很不便看起实际值和预测值关联,使用geom_segment()可以拿实际值点和他么的前瞻值进行对应:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)

图片 4

基于ggplot2的有的参数进行最后之调,包括:

  • 主题,清除全部外观theme_bw();
  • 透过调整alpha透明度淡出连接线;
  • 使用geom_smooth()添加回归斜率。

    library(ggplot2)
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) + # Plot regression slope
    geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines
    geom_point() +
    geom_point(aes(y = predicted), shape = 1) +
    theme_bw() # Add theme for cleaner look

图片 5

Setp4:use residuals to adjust使用残差调整

末了,我们盼望由此残差的价值来高显示其值得大小。有良多种计可达标目的。下面的代码给有了化解方案:

# ALPHA
# 同过残差的绝对值来改变实际值得alpha值
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Alpha adjustments made here...
  geom_point(aes(alpha = abs(residuals))) +  # Alpha mapped to abs(residuals)
  guides(alpha = FALSE) +  # Alpha legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 6

 

足见见,残差值越聊,散点就更加透明。

 

# COLOR
# High residuals (in abolsute terms) made more red on actual values.
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)
  scale_color_continuous(low = "black", high = "red") +  # Colors to use here
  guides(color = FALSE) +  # Color legend removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 7

# SIZE AND COLOR
# Same coloring as above, size corresponding as well
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color AND size adjustments made here...
  geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped
  scale_color_continuous(low = "black", high = "red") +
  guides(color = FALSE, size = FALSE) +  # Size legend also removed
  # <

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 8

# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(d, aes(x = hp, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +

  # > Color adjustments made here...
  geom_point(aes(color = residuals)) +  # Color mapped here
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +  # Colors to use here
  guides(color = FALSE) +

  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 9

最终一个实例非常实用,因为颜料会拉确定数据的非线性。如例子中所出示,从图备受好发现hp较多之新民主主义革命的极值(实际值比预计值大)。而当hp的中价值部分也闹更多的蓝色,表明这些点处的实际值小于预测值。总的来说,表明两独变量间的关系是非线性的,在回归方程中补充加二软项或会见又好之建模。

多元回归

该实例中,我们利用马力(hp)、重量(wt)和动(disp)对各公里加仑数据(mpg)进行回归:

# 选择感兴趣的数据
library(dplyr)
d <- mtcars %>% select(mpg, hp, wt, disp)

# 利用多元回归拟合模型
fit <- lm(mpg ~ hp + wt+ disp, data = d)

# 获得预测值和残差值
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

head(d)
#>                    mpg  hp    wt disp predicted  residuals
#> Mazda RX4         21.0 110 2.620  160  23.57003 -2.5700299
#> Mazda RX4 Wag     21.0 110 2.875  160  22.60080 -1.6008028
#> Datsun 710        22.8  93 2.320  108  25.28868 -2.4886829
#> Hornet 4 Drive    21.4 110 3.215  258  21.21667  0.1833269
#> Hornet Sportabout 18.7 175 3.440  360  18.24072  0.4592780
#> Valiant           18.1 105 3.460  225  20.47216 -2.3721590

制图实际值和预测值得散点图,并消除背景与用那连续:

library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +  # Lines to connect points
  geom_point() +  # Points of actual values
  geom_point(aes(y = predicted), shape = 1) +  # Points of predicted values
  theme_bw()

图片 10

同,我们得采用残差对图片进行优化,以便让了解:

ggplot(d, aes(x = hp, y = mpg)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 11

到目前为止,代码和前基本没有多可怜的转移,可是,由于是多元回归,预测值不再是线性整齐的排列了。

平软利用多单预测椅子绘图

制图一个独的变量是是的,但是,多元回归的目的是研讨多独变量。为了能够又采用多个预计因子,这里发出只技术,就是运tidyr包中之gather()来拿富有数据变成长数,多排成一排列,然后使用ggplot中之facet_*()分面来拿面板分开显示。

如果想如果询问再多点的学识可以点击下面的连年进行阅读:

Plotting background data for groups with
ggplot2

Plot some variables against many others with tidyr and
ggplot2

Quick plot of all
variables

OK,现在我们来兑现地方的要求:

library(tidyr)
d %>% 
  gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>%  # 数据整形,将hp wt disp变为一列
  ggplot(aes(x = x, y = mpg)) +  # Note use of `x` here and next line
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +  # Split panels here by `iv`
  theme_bw()

图片 12

兹我们好试试一下别样的数据集,比如,iris数据集,利用除了Sepal.Width变量的其它连续型变量来回归Sepal.Width变量:

head(iris)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

d <- iris %>% select(-Species)

# Fit the model
fit <- lm(Sepal.Width ~ ., data = iris)

# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)

# Create plot
d %>% 
  gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%
  ggplot(aes(x = x, y = Sepal.Width)) +
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +
  theme_bw()

图片 13

透过地方的觊觎可以比变量的实际值和预测值。值得注意的凡生颜色的点为实际值,灰度标记的尽管也预测值。不出所预期,相比于前方单一的回归,多元回归的预测值相对于实际值的变更减少了。

逻辑回归

今日扩张及逻辑回归,要求有些中坚的工作流(前面说了之步调),然后要取相应变量的预测值和残差值。下面的实例利用hp来预测V/S(vs),vs为二值元素,只有0和1:

# Step 1: Fit the data
d <- mtcars
head(d)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  predicted
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.69769525
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.69769525
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.88099419
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.69769525
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.02608156
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.76479499
##                     residuals
## Mazda RX4         -0.69769525
## Mazda RX4 Wag     -0.69769525
## Datsun 710         0.11900581
## Hornet 4 Drive     0.30230475
## Hornet Sportabout -0.02608156
## Valiant            0.23520501

fit <- glm(vs ~ hp, family = binomial(), data = d)

# Step 2: Obtain predicted and residuals
d$predicted <- predict(fit, type="response")
d$residuals <- residuals(fit, type = "response")

# Steps 3 and 4: plot the results
ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 14

假如我们只是想只要记那些预测错误的种,可以应用dplyr中之filter()函数实现:

ggplot(d, aes(x = hp, y = vs)) +
  geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
  geom_point() +

  # > This plots large red circle on misclassified points
  geom_point(data = d %>% filter(vs != round(predicted)),
             color = "red", size = 2) +

  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

图片 15

扩展:使用broom包

采取broom包可简化以上步骤,将步骤1同步子2统一。broom包的企图是得用统计分析对象转换成R中整的数据框结构。在本实例中,函数augment()的用意是以起草合好的回归模型转换成可用之预测值和残差值得数据框,但是,需要小心以下几点:

  • 预测值和残差值的命名变为.fitted和.resid
  • 咱尚会获取额外的其余的变量,这些变量需要去,可以以gather()函数和facet_*()函数实图片 16

    # install.packages(‘broom’)
    library(broom)
    library(dplyr)

    # Steps 1 and 2
    d <- lm(mpg ~ hp, data = mtcars) %>%
    augment()

    head(d)
    #> .rownames mpg hp .fitted .se.fit .resid .hat
    #> 1 Mazda RX4 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 2 Mazda RX4 Wag 21.0 110 22.59375 0.7772744 -1.5937500 0.04048627
    #> 3 Datsun 710 22.8 93 23.75363 0.8726286 -0.9536307 0.05102911
    #> 4 Hornet 4 Drive 21.4 110 22.59375 0.7772744 -1.1937500 0.04048627
    #> 5 Hornet Sportabout 18.7 175 18.15891 0.7405479 0.5410881 0.03675069
    #> 6 Valiant 18.1 105 22.93489 0.8026728 -4.8348913 0.04317538
    #> .sigma .cooksd .std.resid
    #> 1 3.917367 0.0037426122 -0.4211862
    #> 2 3.917367 0.0037426122 -0.4211862
    #> 3 3.924793 0.0017266396 -0.2534156
    #> 4 3.922478 0.0020997190 -0.3154767
    #> 5 3.927667 0.0003885555 0.1427178
    #> 6 3.820288 0.0369380489 -1.2795289

    # Steps 3 and 4
    ggplot(d, aes(x = hp, y = mpg)) +
    geom_smooth(method = “lm”, se = FALSE, color = “lightgrey”) +
    geom_segment(aes(xend = hp, yend = .fitted), alpha = .2) + # Note .fitted
    geom_point(aes(alpha = abs(.resid))) + # Note .resid
    guides(alpha = FALSE) +
    geom_point(aes(y = .fitted), shape = 1) + # Note .fitted
    theme_bw()

图片 17

正文链接:http://www.cnblogs.com/homewch/p/5806405.html

初稿链接:https://drsimonj.svbtle.com/visualising-residuals