shell-script – 创建从多个.csv文件中检索的唯一名称频率的表

2019-08-15 03:55:46 阅读：168 来源： 互联网

标签：python text-processing shell-script data csv-simple

我有32个CSV文件,包含来自数据库的提取信息.我需要以TSV / CSV格式创建频率表,其中行的名称是每个文件的名称,列的名称是在整个文件中找到的唯一名称.然后需要使用每个文件的每个名称的频率计数填充该表.最大的问题是并非所有文件都包含相同的提取名称.

.csv输入：

$cat file_1

name_of_sequence,C cc,'other_information'
name_of_sequence,C cc,'other_information'
name_of_sequence,C cc,'other_information'
name_of_sequence,D dd,'other_information'
...

$cat file_2

name_of_sequence,B bb,'other_information'
name_of_sequence,C cc,'other_information'
name_of_sequence,C cc,'other_information'
name_of_sequence,C cc,'other_information'
...

$cat file_3

name_of_sequence,A aa,'other_information'
name_of_sequence,A aa,'other_information'
name_of_sequence,A aa,'other_information'
name_of_sequence,A aa,'other_information'
...

$cat `.csv/.tsv` output:

taxa,A aa,B bb,C cc,D dd    
File_1,0,0,3,1    
File_2,0,1,3,0    
File_3,4,0,0,0

使用bash我知道如何剪切第二列,排序和uniq名称,然后获取每个文件中每个名称的计数.我不知道如何创建一个表格,显示所有名称,计数并在文件中不存在名称时设置“0”.我通常使用Bash对数据进行排序,但python脚本也可以正常工作.

解决方法:

以下内容适用于python 2和3,保存为xyz.py并运行
python xyz.py file_1 file_2 file_3：

import sys
import csv

names = set()  # to keep track of all sequence names

files = {}  # map of file_name to dict of sequence_names mapped to counts
# counting
for file_name in sys.argv[1:]:
    # lookup the file_name create a new dict if not in the files dict
    b = files.setdefault(file_name, {})    
    with open(file_name) as fp:
        for line in fp:
            x = line.strip().split()  # split the line 
            names.add(x[1])  # might be a new sequence name
            # retrieve the sequence name or set it if not there yet
            # what would not work is "i += 1" as you would need to assign
            # that to b[x[1]] again. The list "[0]" however is a reference 
            b.setdefault(x[1], [0])[0] += 1  

# output
names = sorted(list(names))  # sort the unique sequence names for the columns
grid = []
# create top line
top_line = ['taxa']
grid.append(top_line)
for name in names:
    top_line.append(name)
# append each files values to the grid
for file_name in sys.argv[1:]:
    data = files[file_name]
    line = [file_name]
    grid.append(line)
    for name in names:
        line.append(data.get(name, [0])[0])  # 0 if sequence name not in file
# dump the grid to CSV
with open('out.csv', 'w') as fp:
    writer = csv.writer(fp)
    writer.writerows(grid)

对计数器使用[0]使得更新值比直接使用整数更容易.如果输入文件更复杂,最好使用Python的CSV库读取它们

标签：python,text-processing,shell-script,data,csv-simple
来源： https://codeday.me/bug/20190815/1660491.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

shell-script – 创建从多个.csv文件中检索的唯一名称频率的表