Hard-masking和Soft-masking结果相互转化
在串联重复序列注释 这篇笔记里记录了如何用TRF软件进行TR预测,这款软件可以使用-m参数得到屏蔽后的序列,当时没写如何把Hard masking结果转换成Soft masking,这里就补个档。这两种屏蔽方式的结果文件是可以相互转换的。
1. Hardmasking转Softmasking 因为每个人的基因组文件可能各不相同,有的序列大小写混杂,有的60个核苷酸或者80个核苷酸换一行,为了方便起见,首先将基因组文件以及hard masking之后的序列文件转换成以下fasta格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 def transform (input_file, output_file ): sequence = [] with open (input_file, 'r' ) as input_f: for line in input_f: if line.startswith('>' ): if sequence: sequence = "" .join(sequence).replace('\n' , '' ).upper() with open (output_file, 'a' ) as new: new.write(sequence + '\n' + line) sequence = [] else : with open (output_file, 'a' ) as new: new.write(line) else : sequence.append(line) sequence = "" .join(sequence).replace('\n' , '' ) with open (output_file, 'a' ) as new: new.write(sequence) transform('genome.fa' , 'standard_genome.fa' )
NCI官网有fasta标准格式要求,这个要求还是相当宽泛的,但是不方便我们做转换,所以还是有必要统一一下。
FASTA Format for Nucleotide Sequences (nih.gov)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 def hard2soft (input_file, output_file, reference_file ): reference_sequences = {} with open (reference_file, 'r' ) as ref: sequence_name = '' sequence = '' for line in ref: line = line.strip() if line.startswith('>' ): if sequence_name: reference_sequences[sequence_name] = sequence sequence_name = line[1 :] sequence = '' else : sequence += line if sequence_name: reference_sequences[sequence_name] = sequence with open (input_file, 'r' ) as input_f, open (output_file, 'w' ) as output_f: sequence_name = '' for line in input_f: line = line.strip() if line.startswith('>' ): sequence_name = line[1 :] output_f.write(line + '\n' ) else : sequence = line if sequence_name in reference_sequences: reference_sequence = reference_sequences[sequence_name] replaced_sequence = '' for i in range (len (sequence)): if sequence[i] == 'N' : if reference_sequence[i] == 'N' : replaced_sequence += 'N' else : replaced_sequence += reference_sequence[i].lower() else : replaced_sequence += sequence[i] output_f.write(replaced_sequence + '\n' ) else : output_f.write(sequence + '\n' ) hard2soft('hardmask.fa' , 'softmask.fa' , 'standard_genome.fa' )
需要注意,如果基因组中原本就有填补gap产生的N,就不需要转换,直接用N表示。
2. Softmasking转Hardmasking Softmasking转Hardmasking的情况就要简单多了,直接搜序列中的小写字母再用N替代即可:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def soft2hard (input_file, output_file ): with open (input_file, 'r' ) as input_f, open (output_file, 'w' ) as output_f: for line in input_f: line = line.strip() if line.startswith('>' ): output_f.write(line + '\n' ) else : sequence = line replaced_sequence = '' for i in range (len (sequence)): if sequence[i].islower(): replaced_sequence += 'N' else : replaced_sequence += sequence[i] output_f.write(replaced_sequence + '\n' ) soft2hard('softmask.txt' , 'hardmask.txt' )
不管用哪种方式屏蔽重复序列,都是为了下游分析服务的,没必要死磕用什么软件才是最优解……能做出符合自己预期的结果就行O(∩_∩)O
顺便一提,UCSC Genome Browser Group提供的生物分析套件和工具的源码,其中也有用TRF获得softmasking结果以及后续分析等一系列的流程,感兴趣可以看github仓库上的perl源码:
kent/src/hg/utils/automation/doSimpleRepeat.pl at 307976d1f4c1ecbc73a55dea1f6348c19c1336b8 · ucscGenomeBrowser/kent (github.com)