将txt文件加载到R中,并根据其他数据帧替换某些值

2022-01-30 14:47:05 标签 rdplyrstringrreadlines

我有一个大的txt文件包含特定的格式结构。我的目标是在R中使用readLines加载文本,我想用基于我的df数据帧的新值替换每个记录的权重值。我不想改变。txt数据结构格式或解析。txt文件。最终输出应该与原始的。txt (writeLines())具有完全相同的结构。我如何读取它并更新值?谢谢

这是我的参考数据框架

df <- tibble::tribble(
        ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
  ) 

下面是我的。txt文件的一个小子集

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

这里是第一个独立记录house_id = 18105265的期望输出:Update Mab_3416311514210525745_W923650。80

带有新值的行Mab_3416311514210525745_W4567 base of df

df的基础

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959

# # #编辑-补充道id to lookup to disti查找以区分非唯一的House_id。

这里有一种方法,我读取数据连接到df中更新的权值,然后使用新权值在以“M”开头的行上创建更新值。

library(tidyverse)
read_fwf("txt_sample.txt" ,  col_positions = fwf_empty("txt_sample.txt")) %>% # edit suggested by DanG
# if the row starts with H, extract 8 digit house number and
# use that to join to the table with new weights
mutate(House_id = if_else(str_starts(X1, "H"), as.numeric(str_sub(X1, 2,9)), NA_real_),
       id = if_else(str_starts(X1, "M"), str_sub(X1, 1,3), NA_character_)) %>%
fill(House_id) %>%
left_join(df, by = c("House_id", "id")) %>%
fill(new_weight) %>%
# make new string using updated weight (or keep existing string)
mutate(X1_new = coalesce(
  if_else(str_starts(X1, "M"),
          paste0(word(X1, end = 2, sep = "_"), "_W", new_weight),
          NA_character_),
  X1)) %>%
pull(X1_new) %>% 
writeLines()

输出

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

###你可以试试下面的基本R代码

writeLines(
  do.call(
    paste0,
    lapply(
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      ),
      function(x) {
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )
      }
    )
  )
)

这给了

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

分解代码

我们首先使用下面的代码将长字符串分割成更小的块

      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      )

对于每个块中的子字符串,我们找到匹配的House_id + id and replace the weight part e.g. <+id an替换重量部分,例如。Wxxxxxx with corresponding new_weig与相应的new_weight value价值

        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(
              rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )

注意,最后一个块有两个不同的匹配id,我们使用Reduce迭代替换权重

###你必须循环遍历不同的行,你获得后readlines of your text document. You can parse the House_id f文本文档的。你可以解析House_id from the lines starting with HH using hpatt = 'H[0-9]+_使用hpatt = 'H[0-9]+_0' as reg,然后应用stringr package to process lines:

包装到生产线:

for (i in 1:length(lines)){
  line = lines[[i]]
  #detect if line looks like 'H[number]_0'
  if (stringr::str_detect(line, hpatt)){
    #if it does, extract the 'house_id' from the line
    h_id = stringr::str_extract(test, pattern = 'H[0-9]+') %>% 
      stringr::str_replace('H|_0','')
  }

在第二部分,你可以替换原来的重量与一个从你的tibble(我在这里称为replacetibble)。我使用regex mpatt = ' ^ [a-zA-z] + _ + _W[0 - 9][0 - 9 \ \。+$'查找像[character-onlyname]_[number]_W[numberwithdecimal]的字符串:

  if (stringr::str_detect(line, mpatt)){
    # split string to get 'id'
    id = stringr::str_split(line, '_')[[1]][[1]]
    # look up weight
    wt = (replacetibble %>% filter(house_id==h_id & id == id) %>% select(weight))
    # replace number in line, split the original line by the 'W'
    # this will of course break if your id contains a W - please
    # adapt logic according to your naming rules
    replaceline = stringr::str_split(line, 'W')[[1]]
    replaceline[length(replaceline)] =wt
    # put the line back together with a 'W' character
    lines[[i]] = paste0(replaceline, collapse = 'W')
  }
}

Stringr(这里是备忘单)在处理字符串方面通常是相当强大的。

装货和存钱这部分就交给你了。

###我试着把每一个步骤放在一个新的对象中,以便更好地理解发生了什么。如果您不清楚任何正则表达式,请毫不犹豫地问。

个人ID仅以“Ma(任何字符)_”开头,可以很容易地扩展,因此一个房子ID可以包含任何数量的个人。

library(tidyverse)
df <- tibble::tribble(
  ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
)
# read the data
dat <- readLines("test.txt")
# convert to tibble
dat2 <- tibble::tibble(X = dat)
# keep relevant info, i.e. house IDs and individual IDs
dat3 <- dat2 %>% 
  rowid_to_column() %>% 
  filter(grepl(pattern = "H[0-9]+_0", X) | 
           grepl(pattern = "^Ma._[0-9]+", X))
dat3
#> # A tibble: 9 × 2
#>   rowid X                                 
#>   <int> <chr>                             
#> 1     1 H18105265_0                       
#> 2     3 Mab_3416311514210525745_W923650.80
#> 3     8 H18117631_0                       
#> 4    10 Maa_1240111711220682016_W123650.80
#> 5    13 H18121405_0                       
#> 6    15 Mab_2467211713110643835_W923650.80
#> 7    19 H71811763_0                       
#> 8    21 Maa_5325411210120486554_W923650.80
#> 9    22 Mab_5325411210110485554_W723650.80
# determine which individuals belong to which house
dat4 <- dat3 %>% 
  mutate(house1 = grepl(pattern = "H[0-9]+_0", X)) %>% 
  mutate(house2 = cumsum(house1))
dat4
#> # A tibble: 9 × 4
#>   rowid X                                  house1 house2
#>   <int> <chr>                              <lgl>   <int>
#> 1     1 H18105265_0                        TRUE        1
#> 2     3 Mab_3416311514210525745_W923650.80 FALSE       1
#> 3     8 H18117631_0                        TRUE        2
#> 4    10 Maa_1240111711220682016_W123650.80 FALSE       2
#> 5    13 H18121405_0                        TRUE        3
#> 6    15 Mab_2467211713110643835_W923650.80 FALSE       3
#> 7    19 H71811763_0                        TRUE        4
#> 8    21 Maa_5325411210120486554_W923650.80 FALSE       4
#> 9    22 Mab_5325411210110485554_W723650.80 FALSE       4
dat4b <- dat4 %>% 
  filter(grepl(pattern = "H[0-9]+_0", X)) %>% 
  select(house_id = X, house2)
dat4b
#> # A tibble: 4 × 2
#>   house_id    house2
#>   <chr>        <int>
#> 1 H18105265_0      1
#> 2 H18117631_0      2
#> 3 H18121405_0      3
#> 4 H71811763_0      4
# combine house and individual ids next to each other
dat5 <- dat4 %>% 
  left_join(dat4b,
            by = "house2") %>% 
  mutate(prefix = gsub(pattern = "_.+", replacement = "", x = X),
         house_id = as.numeric(gsub("^H|_0", "", house_id))) %>% 
  select(rowid, house_id, prefix, X) %>% 
  filter(grepl(pattern = "^Ma._[0-9]+", X)) 
dat5
#> # A tibble: 5 × 4
#>   rowid house_id prefix X                                 
#>   <int>    <dbl> <chr>  <chr>                             
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80
# add he new information about individual ids
dat6 <- left_join(dat5, df,
                  by = c("house_id" = "House_id",
                         "prefix" = "id"))
dat6
#> # A tibble: 5 × 5
#>   rowid house_id prefix X                                  new_weight
#>   <int>    <dbl> <chr>  <chr>                                   <dbl>
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80       4567
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80       3367
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80       4500
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80       2455
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80       2872
# generate the new ids
dat7 <- dat6 %>% 
  mutate(Y = gsub(pattern = "(?=W).+", replacement = "", x = X, perl = T),
         X_new = paste0(Y, "W", new_weight)) %>% 
  select(rowid, X_new)
dat7
#> # A tibble: 5 × 2
#>   rowid X_new                        
#>   <int> <chr>                        
#> 1     3 Mab_3416311514210525745_W4567
#> 2    10 Maa_1240111711220682016_W3367
#> 3    15 Mab_2467211713110643835_W4500
#> 4    21 Maa_5325411210120486554_W2455
#> 5    22 Mab_5325411210110485554_W2872
# replace the old ids by the new ones
dat[dat7$rowid] <- dat7$X_new
dat
#>  [1] "H18105265_0"                                                                           
#>  [2] "R1_0"                                                                                  
#>  [3] "Mab_3416311514210525745_W4567"                                                         
#>  [4] "T1_0"                                                                                  
#>  [5] "T2_0"                                                                                  
#>  [6] "T3_0"                                                                                  
#>  [7] "V64_0_2_010_ab171900171959"                                                            
#>  [8] "H18117631_0"                                                                           
#>  [9] "R1_0"                                                                                  
#> [10] "Maa_1240111711220682016_W3367"                                                         
#> [11] "T1_0"                                                                                  
#> [12] "V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059"
#> [13] "H18121405_0"                                                                           
#> [14] "R1_0"                                                                                  
#> [15] "Mab_2467211713110643835_W4500"                                                         
#> [16] "T1_0"                                                                                  
#> [17] "T2_0"                                                                                  
#> [18] "V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359"          
#> [19] "H71811763_0"                                                                           
#> [20] "R1_0"                                                                                  
#> [21] "Maa_5325411210120486554_W2455"                                                         
#> [22] "Mab_5325411210110485554_W2872"                                                         
#> [23] "T1_0"                                                                                  
#> [24] "T2_0"                                                                                  
#> [25] "T3_0"                                                                                  
#> [26] "T4_0"
# write back the updated data
# writeLines(...)

###这是一个dplyr解决方案,使用一个left_join()...but otherwise relies exclusively on vec。.。但除此之外,它完全依赖于向量化操作,这比循环遍历大型数据集要高效得多。

虽然代码可能看起来很长,但这只是一种格式选择:为了清晰起见,我使用了这种格式

foo(
  arg_1 = bar,
  arg_2 = baz,
  # ...
  arg_n = qux
) 

而不是一行符foo(bar baz qux)。另外,为了清楚起见,我将详细说明这条线

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

在详细信息部分。

解决方案

给出一个类似于subset。txt的文件

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

还有一个参考数据集,比如这里的df

df <- tibble::tribble(
  ~House_id,   ~id, ~new_weight,
   18105265, "Mab",        4567,
   18117631, "Maa",        3367,
   18121405, "Mab",        4500,
   71811763, "Maa",        2455,
   71811763, "Mab",        2872
)

下面的解决方案

# For manipulating data.
library(dplyr)
# ...
# Code to generate your reference 'df'.
# ...
# Specify the filepath.
text_filepath <- "subset.txt"
# Define the textual pattern for each data item we want, where the relevant
# values are divided into their own capture groups.
regex_house_id <- "(H)(\\d+)(_)(\\d)"
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
# Read the textual data (into a dataframe).
data.frame(data = readLines(text_filepath)) %>%
  # Transform the textual data.
  mutate(
    # Target (TRUE) the identifying row (house ID) for each (contiguous) group.
    target = grepl(
      # Use the textual pattern for house IDs.
      pattern = regex_house_id,
      x = data
    ),
    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],
    # Extract the underlying numeric ID from the house ID.
    House_id = gsub(
      pattern = regex_house_id,
      # The numeric ID is in the 2nd capture group.
      replacement = "\\2",
      x = House_id
    ),
    # Treat the numeric ID as a number.
    House_id = as.numeric(House_id),
    # Target (TRUE) the weighted labels.
    target = grepl(
      # Use the textual pattern for weighted labels.
      pattern = regex_weighted_label,
      x = data
    ),
    # Extract the ID from (only) the weighted labels.
    id = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # The ID is in the 1st capture group.
        replacement = "\\1",
        x = data
      ),
      # For any data that is NOT a weighted label, give it a blank (NA) ID.
      as.character(NA)
    ),
    # Extract from (only) the weighted labels everything else but the weight.
    rest = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
        # (1st) and the weight (5th).
        replacement = "\\2\\3\\4",
        x = data
      ),
      # For any data that is NOT a weighted label, make it blank (NA) for
      # everything else.
      as.character(NA)
    )
  ) %>%
  # Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
  # nonmatches.
  left_join(df, by = c("House_id", "id")) %>%
  # Replace (only) the weighted labels, with their updated values.
  mutate(
    data = if_else(
      target,
      # Generate the updated value by splicing together the original components
      # with the new weight.
      paste0(id, rest, new_weight),
      # For data that is NOT a weighted label, leave it unchanged.
      data
    )
  ) %>%
  # Extract the column of updated values.
  .$data %>%
  # Overwrite the original text with the updated values.
  writeLines(con = text_filepath)

将转换文本数据并更新原始文件。

结果

原始文件(这里是subset。txt)现在将包含更新的信息:

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

细节

正则表达式

文本操作仅依赖于grepl()(识别匹配)和gsub()(提取组件)的基本功能。我们将每个文本模式regex_house_id和regex_weighted_label划分为它们的组件,通过正则表达式中的捕获组来区分:

#      The "H" prefix.      The "_" separator.
#                  | |      | |
regex_house_id <- "(H)(\\d+)(_)(\\d)"
#                     |    |   |   |
#  The digits following "H".   The "0" suffix (or any digit).
#                                The digits after the 'id'.
#   The 'id': "M" then 2 small letters.   |    |    The weight (possibly a decimal).
#                          |          |   |    |    |              |
regex_weighted_label <-   "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
#                                      | |      |  |
#                       The "_" separator.      The "_" separator and "W" prefix before weight.

我们可以使用grepl(pattern = regex_weighted_label x = my_strings)来检查vector my_strings中哪些字符串与加权标签的格式匹配(如“Mab_3416311514210525745_W923650。80”)。

我们还可以使用gsub(pattern = regex_weighted label replacement = "\\5" my_labels)从该格式的标签向量my_labels中提取权重(第5个捕获组)。

映射

在第一个mutate()语句行中找到

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

可能出现神秘。然而,这只是一个经典的算术技巧(@mnist在他们的解决方案中也使用了),将连续的值作为组索引。

代码cumsum(目标)扫描目标列(此时在工作流中),该列具有逻辑值(TRUE FALSE FALSE…),指示文本行是否为房子ID(如“H18105265_0”)。当它达到TRUE(数值为1)时,它将增加其运行总数,而FALSE(数值为0)保持总数不变。

因为文本数据列

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

给了我们合乎逻辑的目标栏吗

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...

这些值(TRUE和FALSE)被强制为数值(1和0)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    0     0     0     0     0     0     1    0     0     0     0     0     1    0     ...

在这里生成cumsum():

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...  

注意,现在我们已经将每一行映射到它的“组号”。这就是cumsum(目标)。

现在row_number()[目标]!实际上,row_number()只是“索引”每个位置(行)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1             2      ...                 8             9      ...           13         ...

在数据列(或任何其他列)中:

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

用目标来下标这些指标

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE           FALSE ...                  TRUE          FALSE ...           TRUE       ...

只选择那些有house id的职位:

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1                                         8                                 13         ...

如果我们将这个结果用于row_number()[target]

# House ID: 1st 2nd 3rd ...
# Position:
            1   8   13  ... 

下标为cumsum(target)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...

我们将每一行映射到它的house ID的位置(在数据中):

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     8    8     8     8     8     8     13   13    ...

这是row_number()[target][cumsum(target)]的结果。

最后,当我们用house id的这些(重复的)位置给数据下标时,我们得到了House_id列

# |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
  "H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...

其中data中的每个值都被映射到其所属组的house ID。

感谢这篇House_id专栏

House_id = data[row_number()[target][cumsum(target)]]

在数据列的旁边,我们可以(left_join())将df中的id映射到它们相应的文本数据。

阅读全文

▼ 版权说明

相关文章也很精彩
推荐内容
更多标签
相关热门
全站排行
随便看看

错说 cuoshuo.com —— 程序员的报错记录

部分内容根据CC版权协议转载;网站内容仅供参考,生产环境使用务必查阅官方文档

辽ICP备19011660号-5

×

扫码关注公众号:职场神器
发送: 1
获取永久解锁本站全部文章的验证码