R语言-数据整形之reshape2包

关键词：r语言 reshape、r语言reshape2、r语言中的reshape2包

R语言中reshape2函数族

前言

前几天放出来的那个R的展示中，有说到其实学R的过程更多的就是熟悉各种函数的过程（学习统计模型不在此列…我个人还是倾向于不要借助软件来学习理论知识，虽然可以直接看codes…笔和纸上的推导还是不可或缺的基本功），然后各种基础函数熟悉了之后很多被打包好的函数就是缩短代码长度的利器了。

excel里面有神奇的“数据透视表（pivot table）”，其实很多时候真的已经很神奇了….不过我还是喜欢R，喜欢R直接输出csv或者xlsx的简洁。揉数据呢（学名貌似叫数据整理），我也还是喜欢写出来代码的形式，而不是直接向excel那样面对结果。只是感觉更加不容易出错吧。

揉数据，顾名思义，就是在原有的数据格式基础上，变化出来其他的形式。比如，长长的时间序列变成宽一点的~当然这个可以简单的借助reshape()函数了。可惜我还是不死心，想找一个更好用的，于是就自然而然的看到了reshape2这个包。

这个包里面函数精华在melt()和*cast()。说实话melt()耗了我一段时间来理解，尤其是为什么需要先melt再cast…后来发现这个步骤简直是无敌啊，什么样的形状都变得更加容易揉了，大赞。

warm-up完毕，还是回到正题吧，怎么用reshape2揉数据呢？虽然reshape2支持array, list和data.frame，但是我一般还是习惯于用data.frame，所以还是说说这东西怎么揉吧。揉数据的第一步就是调用melt()函数，不用担心你的input是什么格式，这个函数array, list和data.frame通吃。然后，要告诉他哪些变量是（唯一）识别一个个体的，这句话是什么意思呢？我们先看melt()的参数：

     melt(data, id.vars, measure.vars,
        variable.name = "variable", ..., na.rm = FALSE,
        value.name = "value")

其中id.vars可以指定一系列变量，然后measure.vars就可以留空了，这样生成的新数据会保留id.vars的所有列，然后增加两个新列：variable和value，一个存储变量的名称一个存储变量值。这样就相当于面板数据的长格式了。直接拷一个作者给出的例子：

原数据：

    head(airquality)
      ozone solar.r wind temp month day
    1    41     190  7.4   67     5   1
    2    36     118  8.0   72     5   2
    3    12     149 12.6   74     5   3
    4    18     313 11.5   62     5   4
    5    NA      NA 14.3   56     5   5
    6    28      NA 14.9   66     5   6
    
    dim(airquality)
    [1] 153   6

然后我们将month和day作为识别个体记录的变量，调用melt(airquality, id=c(“month”, “day”))：

    require(reshape2)
    head(melt(airquality, id=c("month", "day")))
      month day variable value
    1     5   1    ozone    41
    2     5   2    ozone    36
    3     5   3    ozone    12
    4     5   4    ozone    18
    5     5   5    ozone    NA
    6     5   6    ozone    28
    
    dim(melt(airquality, id=c("month", "day")))
    [1] 612   4

嗯，这样数据就变长了~然后，就可以随意的cast了…dcast()会给出宽格式的数据，比如我们想把day作为唯一的识别，那么：

    names(airquality) <- tolower(names(airquality))
    aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
    head(dcast(aqm, day ~ variable+month))
    
      day ozone_5 ozone_6 ozone_7 ozone_8 ozone_9 solar.r_5 solar.r_6 solar.r_7 solar.r_8 solar.r_9 wind_5 wind_6 wind_7
    1   1      41      NA     135      39      96       190       286       269        83       167    7.4    8.6    4.1
    2   2      36      NA      49       9      78       118       287       248        24       197    8.0    9.7    9.2
    3   3      12      NA      32      16      73       149       242       236        77       183   12.6   16.1    9.2
    4   4      18      NA      NA      78      91       313       186       101        NA       189   11.5    9.2   10.9
    5   5      NA      NA      64      35      47        NA       220       175        NA        95   14.3    8.6    4.6
    6   6      28      NA      40      66      32        NA       264       314        NA        92   14.9   14.3   10.9
      wind_8 wind_9 temp_5 temp_6 temp_7 temp_8 temp_9
    1    6.9    6.9     67     78     84     81     91
    2   13.8    5.1     72     74     85     81     92
    3    7.4    2.8     74     67     81     82     93
    4    6.9    4.6     62     84     84     86     93
    5    7.4    7.4     56     85     83     85     87
    6    4.6   15.5     66     79     83     87     84

或者对于每个月，求平均数：

     head(dcast(aqm, month ~ variable, mean, margins = c("month", "variable")))
    
      month    ozone  solar.r      wind     temp    (all)
    1     5 23.61538 181.2963 11.622581 65.54839 68.70696
    2     6 29.44444 190.1667 10.266667 79.10000 87.38384
    3     7 59.11538 216.4839  8.941935 83.90323 93.49748
    4     8 59.96154 171.8571  8.793548 83.96774 79.71207
    5     9 31.44828 167.4333 10.180000 76.90000 71.82689
    6 (all) 42.12931 185.9315  9.957516 77.88235 80.05722

当然还有更强大的acast()，配合.函数：

    library(plyr) # needed to access . function
    acast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
    
                 5        6        7        8        9
    ozone 23.61538 29.44444 59.11538 59.96154 31.44828

嗯，基本上数据就可以这么揉来揉去了…哈哈。怎么感觉有点像数据透视表捏？只是更加灵活，还可以自定义函数。

此外还有recast()可以一步到位，只是返回的是list；colsplit()可以分割变量名…函数不多，却精华的很啊。

Example_1

    # code_1
    require(reshape2)
    x = data.frame(subject = c("John", "Mary"), 
                   time = c(1,1),
                   age = c(33,NA),
                   weight = c(90, NA),
                   height = c(2,2))
    x
      subject time age weight height
    1    John    1  33     90      2
    2    Mary    1  NA     NA      2

    ------------------------------------------------------
    
    # code_2
    molten = melt(x, id = c("subject", "time"))
    molten
      subject time variable value
    1    John    1      age    33
    2    Mary    1      age    NA
    3    John    1   weight    90
    4    Mary    1   weight    NA
    5    John    1   height     2
    6    Mary    1   height     2
    
    ------------------------------------------------------      

    # code_3
    molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
    molten
      subject time variable value
    1    John    1      age    33
    3    John    1   weight    90
    5    John    1   height     2
    6    Mary    1   height     2

    ------------------------------------------------------
    
    # 语句
    dcast(molten, formula = time + subject ~ variable)
    dcast(molten, formula = subject + time  ~ variable)
    dcast(molten, formula = subject  ~ variable)
    dcast(molten, formula = ...  ~ variable)            
    
    # 结果
    > dcast(molten, formula = time + subject ~ variable)
      time subject age weight height
    1    1    John  33     90      2
    2    1    Mary  NA     NA      2
    > dcast(molten, formula = subject + time  ~ variable)
      subject time age weight height
    1    John    1  33     90      2
    2    Mary    1  NA     NA      2
    > dcast(molten, formula = subject  ~ variable)
      subject age weight height
    1    John  33     90      2
    2    Mary  NA     NA      2
    > dcast(molten, formula = ...  ~ variable)
      subject time age weight height
    1    John    1  33     90      2
    2    Mary    1  NA     NA      2
    
    ------------------------------------------------------
    
    # 语句 
    acast(molten, formula = subject  ~ time ~ variable)     
    
    # 结果
    > acast(molten, formula = subject  ~ time ~ variable)
    , , age
    
          1
    John 33
    Mary NA
    
    , , weight
    
          1
    John 90
    Mary NA
    
    , , height
    
         1
    John 2
    Mary 2
    
    ------------------------------------------------------

    # Melt French Fries dataset
    data(french_fries)
    head(french_fries)
    ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
    head(ffm)
    
    # Aggregate examples - all 3 yield the same result
    dcast(ffm, treatment ~ .)
    dcast(ffm, treatment ~ ., function(x) length(x))
    dcast(ffm, treatment ~ ., length) 
    
    # Passing further arguments through ...
    dcast(ffm, treatment ~ ., sum)
    dcast(ffm, treatment ~ ., sum, trim = 0.1)

    ------------------------------------------------------

Example_2

Converting data between wide and long format

Problem

You want to do convert data from a wide format to a long format.

Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.

Solution

There are two sets of methods that are explained below:

gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.
melt() and dcast() from the reshape2 package.

There are a number of other methods which aren’t covered here, since they are not as easy to use:

The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.
stack() and unstack()

Sample data

These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.

    olddata_wide <- read.table(header=TRUE, text='
     subject sex control cond1 cond2
           1   M     7.9  12.3  10.7
           2   F     6.3  10.6  11.1
           3   F     9.5  13.1  13.8
           4   M    11.5  13.4  12.9
    ')
    # Make sure the subject column is a factor
    olddata_wide$subject <- factor(olddata_wide$subject)

    olddata_long <- read.table(header=TRUE, text='
     subject sex condition measurement
           1   M   control         7.9
           1   M     cond1        12.3
           1   M     cond2        10.7
           2   F   control         6.3
           2   F     cond1        10.6
           2   F     cond2        11.1
           3   F   control         9.5
           3   F     cond1        13.1
           3   F     cond2        13.8
           4   M   control        11.5
           4   M     cond1        13.4
           4   M     cond2        12.9
    ')
    # Make sure the subject column is a factor
    olddata_long$subject <- factor(olddata_long$subject)

tidyr

From wide to long

Use `gather`:

    olddata_wide
    #>   subject sex control cond1 cond2
    #> 1       1   M     7.9  12.3  10.7
    #> 2       2   F     6.3  10.6  11.1
    #> 3       3   F     9.5  13.1  13.8
    #> 4       4   M    11.5  13.4  12.9
    
    library(tidyr)
    
    # The arguments to gather():
    # - data: Data object
    # - key: Name of new key column (made from names of data columns)
    # - value: Name of new value column
    # - ...: Names of source columns that contain values
    data_long <- gather(olddata_wide, condition, measurement, control:cond2)
    data_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 2        2   F   control         6.3
    #> 3        3   F   control         9.5
    #> 4        4   M   control        11.5
    #> 5        1   M     cond1        12.3
    #> 6        2   F     cond1        10.6
    #> 7        3   F     cond1        13.1
    #> 8        4   M     cond1        13.4
    #> 9        1   M     cond2        10.7
    #> 10       2   F     cond2        11.1
    #> 11       3   F     cond2        13.8
    #> 12       4   M     cond2        12.9

In this example, the source columns that are gathered are specified with control:cond2. This means to use all the columns, positionally, between control and cond2. Another way of doing it is to name the columns individually, as in:

    gather(olddata_wide, condition, measurement, control, cond1, cond2)

If you need to use gather() programmatically, you may need to use variables containing column names. To do this, you should use the gather_()function instead, which takes strings instead of bare (unquoted) column names.

    keycol <- "condition"
    valuecol <- "measurement"
    gathercols <- c("control", "cond1", "cond2")
    
    gather_(olddata_wide, keycol, valuecol, gathercols)

Optional: Rename the factor levels of the variable column, and sort.

    # Rename factor names from "cond1" and "cond2" to "first" and "second"
    levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
    levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
    
    # Sort by subject first, then by condition
    data_long <- data_long[order(data_long$subject, data_long$condition), ]
    data_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 5        1   M     first        12.3
    #> 9        1   M    second        10.7
    #> 2        2   F   control         6.3
    #> 6        2   F     first        10.6
    #> 10       2   F    second        11.1
    #> 3        3   F   control         9.5
    #> 7        3   F     first        13.1
    #> 11       3   F    second        13.8
    #> 4        4   M   control        11.5
    #> 8        4   M     first        13.4
    #> 12       4   M    second        12.9

From long to wide

Use `spread`:

    olddata_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 2        1   M     cond1        12.3
    #> 3        1   M     cond2        10.7
    #> 4        2   F   control         6.3
    #> 5        2   F     cond1        10.6
    #> 6        2   F     cond2        11.1
    #> 7        3   F   control         9.5
    #> 8        3   F     cond1        13.1
    #> 9        3   F     cond2        13.8
    #> 10       4   M   control        11.5
    #> 11       4   M     cond1        13.4
    #> 12       4   M     cond2        12.9
    
    library(tidyr)
    
    # The arguments to spread():
    # - data: Data object
    # - key: Name of column containing the new column names
    # - value: Name of column containing values
    data_wide <- spread(olddata_long, condition, measurement)
    data_wide
    #>   subject sex cond1 cond2 control
    #> 1       1   M  12.3  10.7     7.9
    #> 2       2   F  10.6  11.1     6.3
    #> 3       3   F  13.1  13.8     9.5
    #> 4       4   M  13.4  12.9    11.5

Optional: A few things to make the data look nicer.

    # Rename cond1 to first, and cond2 to second
    names(data_wide)[names(data_wide)=="cond1"] <- "first"
    names(data_wide)[names(data_wide)=="cond2"] <- "second"
    
    # Reorder the columns
    data_wide <- data_wide[, c(1,2,5,3,4)]
    data_wide
    #>   subject sex control first second
    #> 1       1   M     7.9  12.3   10.7
    #> 2       2   F     6.3  10.6   11.1
    #> 3       3   F     9.5  13.1   13.8
    #> 4       4   M    11.5  13.4   12.9

The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

reshape2

From wide to long

Use `melt`:

    olddata_wide
    #>   subject sex control cond1 cond2
    #> 1       1   M     7.9  12.3  10.7
    #> 2       2   F     6.3  10.6  11.1
    #> 3       3   F     9.5  13.1  13.8
    #> 4       4   M    11.5  13.4  12.9
    
    library(reshape2)
    
    # Specify id.vars: the variables to keep but not split apart on
    melt(olddata_wide, id.vars=c("subject", "sex"))
    #>    subject sex variable value
    #> 1        1   M  control   7.9
    #> 2        2   F  control   6.3
    #> 3        3   F  control   9.5
    #> 4        4   M  control  11.5
    #> 5        1   M    cond1  12.3
    #> 6        2   F    cond1  10.6
    #> 7        3   F    cond1  13.1
    #> 8        4   M    cond1  13.4
    #> 9        1   M    cond2  10.7
    #> 10       2   F    cond2  11.1
    #> 11       3   F    cond2  13.8
    #> 12       4   M    cond2  12.9

There are options for melt that can make the output a little easier to work with:

    data_long <- melt(olddata_wide,
            # ID variables - all the variables to keep but not split apart on
        id.vars=c("subject", "sex"),
            # The source columns
        measure.vars=c("control", "cond1", "cond2" ),
            # Name of the destination column that will identify the original
            # column that the measurement came from
        variable.name="condition",
        value.name="measurement"
    )
    data_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 2        2   F   control         6.3
    #> 3        3   F   control         9.5
    #> 4        4   M   control        11.5
    #> 5        1   M     cond1        12.3
    #> 6        2   F     cond1        10.6
    #> 7        3   F     cond1        13.1
    #> 8        4   M     cond1        13.4
    #> 9        1   M     cond2        10.7
    #> 10       2   F     cond2        11.1
    #> 11       3   F     cond2        13.8
    #> 12       4   M     cond2        12.9

If you leave out the measure.vars, melt will automatically use all the other variables as the id.vars. The reverse is true if you leave out id.vars.

If you don’t specify variable.name, it will name that column “variable”, and if you leave out value.name, it will name that column “measurement”.

Optional: Rename the factor levels of the variable column.

    # Rename factor names from "cond1" and "cond2" to "first" and "second"
    levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
    levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
    
    # Sort by subject first, then by condition
    data_long <- data_long[ order(data_long$subject, data_long$condition), ]
    data_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 5        1   M     first        12.3
    #> 9        1   M    second        10.7
    #> 2        2   F   control         6.3
    #> 6        2   F     first        10.6
    #> 10       2   F    second        11.1
    #> 3        3   F   control         9.5
    #> 7        3   F     first        13.1
    #> 11       3   F    second        13.8
    #> 4        4   M   control        11.5
    #> 8        4   M     first        13.4
    #> 12       4   M    second        12.9

From long to wide

The following code uses dcast to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast instead.

    olddata_long
    #>    subject sex condition measurement
    #> 1        1   M   control         7.9
    #> 2        1   M     cond1        12.3
    #> 3        1   M     cond2        10.7
    #> 4        2   F   control         6.3
    #> 5        2   F     cond1        10.6
    #> 6        2   F     cond2        11.1
    #> 7        3   F   control         9.5
    #> 8        3   F     cond1        13.1
    #> 9        3   F     cond2        13.8
    #> 10       4   M   control        11.5
    #> 11       4   M     cond1        13.4
    #> 12       4   M     cond2        12.9
    
    # From the source:
    # "subject" and "sex" are columns we want to keep the same
    # "condition" is the column that contains the names of the new column to put things in
    # "measurement" holds the measurements
    library(reshape2)
    
    data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement")
    data_wide
    #>   subject sex cond1 cond2 control
    #> 1       1   M  12.3  10.7     7.9
    #> 2       2   F  10.6  11.1     6.3
    #> 3       3   F  13.1  13.8     9.5
    #> 4       4   M  13.4  12.9    11.5

Optional: A few things to make the data look nicer.

    # Rename cond1 to first, and cond2 to second
    names(data_wide)[names(data_wide)=="cond1"] <- "first"
    names(data_wide)[names(data_wide)=="cond2"] <- "second"
    
    # Reorder the columns
    data_wide <- data_wide[, c(1,2,5,3,4)]
    data_wide
    #>   subject sex control first second
    #> 1       1   M     7.9  12.3   10.7
    #> 2       2   F     6.3  10.6   11.1
    #> 3       3   F     9.5  13.1   13.8
    #> 4       4   M    11.5  13.4   12.9

The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

转载请注明：数据分析 » R语言-数据整形之reshape2包_r语言 reshape

R语言-数据整形之reshape2包

R语言中reshape2函数族

前言

Example_1

Example_2

Converting data between wide and long format

Problem

Solution

Sample data

tidyr

From wide to long

Use gather:

From long to wide

Use spread:

reshape2

From wide to long

Use melt:

From long to wide

Use `gather`:

Use `spread`:

Use `melt`: