ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

如何在BASH中将制表符分隔值(TSV)文件转换为逗号分隔值(CSV)文件?(How do I convert a tab-separated values (TSV) file to a comma

2021-10-27 17:03:43  阅读:242  来源: 互联网

标签:separated csv sed awk file tsv CSV


我有一些TSV文件需要转换为CSV文件. BASH中是否有任何解决方案,例如使用awk来转换这些?我可以这样使用sed,但担心它会出错:

sed 's/\t/,/g' file.tsv > file.csv
  • 不需要添加行情.

如何将TSV转换为CSV?

解决方案

更新:尽管以下解决方案总体上不可靠 在OP的特定用例中进行工作;请参见底部部分,以获取基于awk的可靠解决方案.


总结选项(有趣的是,它们的表现大致相同):

tr :

devnull 的解决方案(在问题注释中提供)是最简单的:

tr '\t' ',' < file.tsv > file.csv

固定:

OP自己的sed解决方案非常好,因为输入不包含带引号的字符串(可能嵌入了\t字符.):

sed 's/\t/,/g' file.tsv > file.csv

唯一需要注意的是,在某些平台(例如macOS)上,不支持转义序列\t,因此使用文字制表符char.必须使用ANSI引号($'\t')拼接到命令字符串中:

sed 's/'$'\t''/,/g' file.tsv > file.csv

awk :

awk的警告是FS-输入字段分隔符-必须设置为\t 明确-默认行为否则会剥离前导和尾随制表符并替换内部跨度只有一个,

的多个选项卡
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv

请注意,简单地为其分配$1会导致awk使用OFS- output 字段分隔符重建输入行;这有效地替换了所有\t字符.与,字符. print然后简单地打印重建的行.


强大的awk解决方案:

A. Rabus 指出,以上解决方案无法正确处理本身包含,字符的未加引号的输入字段-您最终将获得额外的CSV字段.

下面的awk解决方案通过按需将这些字段包含在"..."中来解决此问题(有关该方法的部分说明,请参见上面的非稳健的awk解决方案).

如果此类字段也嵌入了"字符,则会按照 RFC 4180 .谢谢,怀亚特以色列.

awk 'BEGIN { FS="\t"; OFS="," } {
  rebuilt=0
  for(i=1; i<=NF; ++i) {
    if ($i ~ /,/ && $i !~ /^".*"$/) { 
      gsub("\"", "\"\"", $i)
      $i = "\"" $i "\""
      rebuilt=1 
    }
  }
  if (!rebuilt) { $1=$1 }
  print
}' file.tsv > file.csv
  • $i ~ /[,"]/ && $i !~ /^".*"$/检测到任何包含,和/或"并且尚未用双引号引起来的字段

  • gsub("\"", "\"\"", $i)转义嵌入的"字符.将它们加倍

  • $i = "\"" $i "\""通过将结果括在双引号中来更新结果

  • 如前所述,更新任何字段都会导致awk用OFS值(即,)从字段重建在这种情况下,相当于有效的TSV-> CSV转换;标志rebuilt用于确保至少一次重新构建每个输入记录.

I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk, to convert these? I could use sed, like this, but am worried it will make some mistakes:

sed 's/\t/,/g' file.tsv > file.csv
  • Quotes needn't be added.

How can I convert a TSV to a CSV?

解决方案

Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk-based solution.


To summarize the options (interestingly, they all perform about the same):

tr:

devnull's solution (provided in a comment on the question) is the simplest:

tr '\t' ',' < file.tsv > file.csv

sed:

The OP's own sed solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t chars.):

sed 's/\t/,/g' file.tsv > file.csv

The only caveat is that on some platforms (e.g., macOS) the escape sequence \t is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'):

sed 's/'$'\t''/,/g' file.tsv > file.csv

awk:

The caveat with awk is that FS - the input field separator - must be set to \t explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,:

awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv

Note that simply assigning $1 to itself causes awk to rebuild the input line using OFS - the output field separator; this effectively replaces all \t chars. with , chars. print then simply prints the rebuilt line.


Robust awk solution:

As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain , characters correctly - you'll end up with extra CSV fields.

The following awk solution fixes this, by enclosing such fields in "..." on demand (see the non-robust awk solution above for a partial explanation of the approach).

If such fields also have embedded " chars., these are escaped as "", in line with RFC 4180.Thanks, Wyatt Israel.

awk 'BEGIN { FS="\t"; OFS="," } {
  rebuilt=0
  for(i=1; i<=NF; ++i) {
    if ($i ~ /,/ && $i !~ /^".*"$/) { 
      gsub("\"", "\"\"", $i)
      $i = "\"" $i "\""
      rebuilt=1 
    }
  }
  if (!rebuilt) { $1=$1 }
  print
}' file.tsv > file.csv
  • $i ~ /[,"]/ && $i !~ /^".*"$/ detects any field that contains , and/or " and isn't already enclosed in double quotes

  • gsub("\"", "\"\"", $i) escapes embedded " chars. by doubling them

  • $i = "\"" $i "\"" updates the result by enclosing it in double quotes

  • As stated before, updating any field causes awk to rebuild the line from the fields with the OFS value, i.e., , in this case, which amounts to the effective TSV -> CSV conversion; flag rebuilt is used to ensure that each input record is rebuilt at least once.

标签:separated,csv,sed,awk,file,tsv,CSV
来源: https://www.cnblogs.com/exmyth/p/15471565.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有