首页 > 其他分享> 文章详细

snakemake教程-03额外特征

2021-07-04 22:31:03 阅读：533 来源： 互联网

标签：03 教程 -- sample reads snakemake input bam

文章目录

- 额外的特征

参考： https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html

额外的特征

接下来，我们将介绍更多上面的案例没有包括的特征。更多细节和更多特征，可以参考 Writing Workflows, Frequently Asked Questions或者使用命令行帮助(snakemake --help)。

1. Benchmarking（基准测试）

使用benchmark指令，Snakemake可以度量一项工作的运行时间。我们可以在bwa_map中激活benchmarking:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

可以多次重复基准测试，了解测量量的可变性。这可以通过另一个注释benchmark文件来实现，例如：repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)。Snakemake会被告知执行三次。重复度量可以作为tab分割的基准文件的后续行。

运行代码为：

##删掉保护的bam文件
rm -rf /share/inspurStorage/home1/jialh/tools/snakemake/snakemake-tutorial/sorted_reads

##重新运行整个流程
snakemake --forceall --cores 4

2. Modularization（模块化）

为了重用构建的模块或者简化结构化的大型流程，有时候需要合理地split a workflow into modules。对此，Snakemake提供了include指令来将另一个Snakefile包含进当前的Snakefile，例如：

include: "path/to/other.snakefile"

另外，Snakemake还允许定义子流程（define sub-workflows，详情可参考https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#snakefiles-sub-workflows）。基本示例如下：

subworkflow otherworkflow:
    workdir:
        "../path/to/otherworkflow"
    snakefile:
        "../path/to/otherworkflow/Snakefile"
    configfile:
        "path/to/custom_configfile.yaml"

rule a:
    input:
        otherworkflow("test.txt")
    output: ...
    shell:  ...

练习

将读段比对相关的rules作为分开的Snakefile, 然后用include指令，使它们能够在我们的示例流程中工作。

mapping_snakefile文件中写入：

##snakemake --cores 4 -s Snakefile
##snakemake --cores 4 mapped_reads/B.bam
##snakemake --cores 4 mapped_reads/C.bam
def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

主snakefile为：

include: "/home1/jialh/tools/snakemake/snakemake-tutorial/mapping_Snakefile" ##注意需要使用引号
#SAMPLES = ["A", "B"]
configfile: "config.yaml"
rule all:
    input:
        "plots/quals.svg"

##snakemake --cores 4 sorted_reads/B.bam
##snakemake --cores 4 sorted_reads/{A,B,C}.bam
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

##snakemake --cores 4 sorted_reads/{A,B}.bam.bai
##snakemake --dag sorted_reads/{A,B}.bam | dot -Tsvg > dag_test.svg
rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

##snakemake --cores 4  calls/all.vcf
print(config["pmr"])
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    params:
        pmr=config["pmr"]
    log:
        "logs/bcftools_call/all_vcf.log"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv -P {params.pmr} - > {output}"

##snakemake --dag calls/all.vcf | dot -Tsvg > dag_vcf.svg

##使用自己写的脚本。
##snakemake --cores 4  plots/quals.svg
rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"

3.Automatic deployment of software dependencies（自动部署软件依赖项）

为了完全重复数据分析，能够执行每个步骤和指定所有使用的参数是不够的。使用的软件工具和库也需要写清楚。在本教程中，你已经看到了Conda如何为整个流程指定独立的软件环境。使用Snakefile，你能够更进一步，指定每个rule的Conda环境。这种方式，你可以使用冲突的软件版本（例如，结合Python2和Python3）。

在我们的案例中，除了外部环境，我们可以指定每条rule的环境，例如：

rule samtools_index:
  input:
      "sorted_reads/{sample}.bam"
  output:
      "sorted_reads/{sample}.bam.bai"
  conda:
      "envs/samtools.yaml"
  shell:
      "samtools index {input}"

其中，envs/samtools.yaml可以被定义为：

channels:
  - default
  - bioconda
  - conda-forge
dependencies:
  - samtools =1.9

Snakemake可以被执行为：

snakemake --use-conda --cores 4 --forceall

如果自动创建需要的环境和然后在执行工作前激活它们。最好是指定该环境需要的packages的最低和最高版本。上述运行结果如下：

会尝试下载和激活所需的环境。

4. Tool wrappers（工具包装）

为了简化流行工具的使用，Snakemake提供了包装资源库（ Snakemake wrapper repository）。A wrapper是一段短的脚本（通常包装了命令行应用，使之在Snakemake内部能够直接处理。对此，Snakemake提供了wrapper指令来代替shell, script或者run。例如bwa_map可以被写成如下形式：

rule bwa_mem:
  input:
      ref="data/genome.fa",
      sample=lambda wildcards: config["samples"][wildcards.sample]
  output:
      temp("mapped_reads/{sample}.bam")
  log:
      "logs/bwa_mem/{sample}.log"
  params:
      "-R '@RG\tID:{sample}\tSM:{sample}'"
  threads: 8
  wrapper:
      "0.15.3/bio/bwa/mem"

wrapper指令需要（部分）URL直接指向存储库中的包装器。这些可以在对应的数据库（database）中看到。在调用时，Snakemake将自动下载请求的包装器版本。进一步，结合--use-conda，所需的软件将自动在执行之前下载。

[注意：按照实际操作看，–use-conda会导致程序执行很慢，而且可能导致奇怪的错误，慎用。]

5. Cluster execution

默认情形下，Snakemake在本地机器上执行工作。然而，其也可以在分布式环境下执行工作，例如：compute cluster or batch system。如果节点共享公共的文件系统，Snakemake支持三种可选的执行模式。

在集群环境中，计算工作通常通过类似qsub的命令提交给shell脚本。Snakemake提供了一种generic mode在clusters上执行。通过调用Snakemake：

snakemake --cluster qsub --jobs 100

此部分暂时用不到，略过。

6. Using –cluster-status

可用于检测，是否一个cluster job被完全执行成功了。

7.Constraining wildcards

Snakemake使用正则表达式来匹配输入文件和输出文件，进而决定jobs之间的依赖关系。有时候，限制一个通配符的值非常有用。这可以通过增加描述允许通配符的值的集合的正则表达式来实现。例如，通配符sample在输出文件"sorted_reads/{sample}.bam"可以限制其只允许字母数字式样的sample name为"sorted_reads/{sample,[A-Za-z0-9]+}.bam"。每个规则都可以定义限制，或者在全局使用wildcard_constrains关键字，正如Wildcards所展示的。这种机制能够解决两种类型的混淆。

其可以避免歧义规则，例如两个或者多个规则被用于产生相同的输出文件。其他处理混淆规则的方法会在 Handling Ambiguous Rules中描述。
其有助于基于匹配指导正则表达式，使得通配符被分配到文件名的正确部分。考虑输出文件{sample}.{group}.txt，假设目标文件是A.1.normal.txt。不清楚到底是dataset="A.1"和group="normal"，还是dataset="A"和group="1.normal"是正确的赋值。此处，通过{sample,[A-Z]+}.{group}限制数据集通配符可以解决此问题。

当解决混淆规则时，最好的办法是首先尝试通过合适的文件结构解决混淆，例如，通过将不同步骤的输出文件划分到不同的目录。

标签：03,教程,--,sample,reads,snakemake,input,bam
来源： https://blog.csdn.net/qq_36333576/article/details/118468768

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9