【0からのNGSデータ解析】bamファイルから特定位置のリード数を調べる

せい

こんにちは。せいです。未経験ながら、会社でNGSデータを使って遺伝子解析をしています。

注目したいSNPは決めたけど、そのSNPがNGSでどのくらいリードが読まれているのか知りたい。

今回の投稿では、bam-readcountというツールを使って、SNPでどの塩基が何リード読まれているかを調べる方法をご紹介します。

特定の染色体および染色体領域のデータに分割する
- samtoolsのインストールの方法
- samtoolsを使って染色体と領域でデータを分割する
特定のSNPのリードカウントを調べる
まとめ
- 共有:
- いいね:

特定の染色体および染色体領域のデータに分割する

リードカウントを調べる前に、まずはどの染色体（領域）に関して調べるのかを決めましょう。

全染色体をいっぺんに調べようとすると、データ量が重すぎるため時間がかかります。最悪、スペックによってはPCがフリーズする可能性があります。

samtoolsを使用することによって、全染色体から特定の染色体（領域）抽出して解析することができます。

samtoolsのインストールの方法

特定の染色体のみを切り出すために、まずはsamtoolsをインストールしましょう。

インストールの方法は、こちらから確認することができます。

samtoolsを使って染色体と領域でデータを分割する

それではインストールしたsamtoolsを使って、bamファイルを特定の染色体と染色体領域に分割していきます。

まずは、特定の染色体のみの情報を取得する方法です。

samtools sort input.bam -o sort.bam
samtools index sort.bam
samtools view -b sort.bam chr1 > output_chr1.bam

samtools sort input.bam -o sort.bam

samtools index sort.bam

samtools view -b sort.bam chr1 > output_chr1.bam

１行目でソート、２行目でインデックスを付加して、3行目で一番染色体をbamファイルから切り出しています。

染色体の特定の領域だけを切り出したい場合は、３行目を

samtools view -b sort.bam chr1:100-1000 > output_chr1.bam

1	samtools view -b sort.bam chr1:100-1000 > output_chr1.bam

とすると、一番染色体の100から1000bpの位置だけを切り出すことができます。

では、次の章で、これら切り出した特定の染色体や領域でのリードカウントを調べていきます。

特定のSNPのリードカウントを調べる

特定の染色体を抽出してきたら、今回の目的であるリード数をカウントしていきます。様々な方法でbamファイルから各染色体ポジションのリード数をカウントすることができますが、今回はbam-readcountというツールを使います。

bam-readcountのインストールの方法

それでは、実際にbamファイルでリード数を調べることができるbam-readcountをインストールしていきましょう。

インストールに必要なcmakeはMacには入っていないので、予めインストールしておきます。

brew install cmake

1	brew install cmake

bam-readcountのgithubからインストールをしていきます。

git clone https://github.com/genome/bam-readcount.git
cd bam-readcount
mkdir build 
cd build
cmake ..
make
cd bin

git clone https://github.com/genome/bam-readcount.git

cd bam-readcount

mkdir build

cd build

cmake ..

make

cd bin

インストールがされたかを確認します。

bam-readcount

1	bam-readcount

以下が表示されればインストール完了です。

Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam

Available options:
  -h [ --help ]                         produce this message
  -v [ --version ]                      output the version number
  -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
                                        for counting.
  -b [ --min-base-quality ] arg (=0)    minimum base quality at a position to
                                        use the read for counting.
  -d [ --max-count ] arg (=10000000)    max depth to avoid excessive memory
                                        usage.
  -l [ --site-list ] arg                file containing a list of regions to
                                        report readcounts within.
  -f [ --reference-fasta ] arg          reference sequence in the fasta format.
  -D [ --print-individual-mapq ] arg    report the mapping qualities as a comma
                                        separated list.
  -p [ --per-library ]                  report results by library.
  -w [ --max-warnings ] arg             maximum number of warnings of each type
                                        to emit. -1 gives an unlimited number.
  -i [ --insertion-centric ]            generate indel centric readcounts.
                                        Reads containing insertions will not be
                                        included in per-base counts

Usage: bam-readcount [OPTIONS] <bam_file> [region]

Generate metrics for bam_file at single nucleotide positions.

Example: bam-readcount -f ref.fa some.bam

Available options:

-h [ --help ] produce this message

-v [ --version ] output the version number

-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used

for counting.

-b [ --min-base-quality ] arg (=0) minimum base quality at a position to

use the read for counting.

-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory

usage.

-l [ --site-list ] arg file containing a list of regions to

report readcounts within.

-f [ --reference-fasta ] arg reference sequence in the fasta format.

-D [ --print-individual-mapq ] arg report the mapping qualities as a comma

separated list.

-p [ --per-library ] report results by library.

-w [ --max-warnings ] arg maximum number of warnings of each type

to emit. -1 gives an unlimited number.

-i [ --insertion-centric ] generate indel centric readcounts.

Reads containing insertions will not be

included in per-base counts

最後にパスを通します。

./bam-readcount
vi ~/.profile
export PATH="パス名/bam-readcount/build/bin/:$PATH"

./bam-readcount

vi ~/.profile

export PATH="パス名/bam-readcount/build/bin/:$PATH"

これでbamファイルからリード数を調べることができます。

次の章で実際にリード数を調べていきます。

bam-readcountを使ってbamファイルのリードカウントを調べる

ここまでで、全染色体配列から特定の染色体の情報だけを抽出し、リードカウントを調べるためツールbam-readcountをインストールができました。

実際にリード数を調べるためには以下のものを準備します。

・調べたい染色体のリファレンスとなるゲノム配列(.fa)

・特定の染色体のbamファイル

では早速、先ほど作った特定の染色体のリード数を調べていきましょう。下記のコードを入力します。

bam-readcount -f リファレンス.fa -w 0 input.bam > output.txt

1	bam-readcount -f リファレンス.fa -w 0 input.bam > output.txt

今回は、データの処理がしやすいようにテキストファイルとして結果を出力してます。

bam-readcountで出力されたデータの見方

左から順番に

染色体番号

染色体の位置

リファレンスゲノムの塩基

NGSで読まれたリード数

以降は、それぞれの塩基（A,C,G,T）に対する情報が並んで行きます。

base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end

と、「：」で区切られています。それぞれの意味は以下の通りです。

base → the base that all reads following in this field contain at the reported position i.e. C
count → the number of reads containing the base
avg_mapping_quality → the mean mapping quality of reads containing the base
avg_basequality → the mean base quality for these reads
avg_se_mapping_quality → mean single ended mapping quality
num_plus_strand → number of reads on the plus/forward strand
num_minus_strand → number of reads on the minus/reverse strand
avg_pos_as_fraction → average position on the read as a fraction (calculated with respect to the length after clipping). This value is normalized to the center of the read (bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0)
avg_num_mismatches_as_fraction → average number of mismatches on these reads per base
avg_sum_mismatch_qualities → average sum of the base qualities of mismatches in the reads
num_q2_containing_reads → number of reads with q2 runs at the 3’ end
avg_distance_to_q2_start_in_q2_reads → average distance of position (as fraction of unclipped read length) to the start of the q2 run
avg_clipped_length → average clipped read length of reads
avg_distance_to_effective_3p_end → average distance to the 3’ prime end of the read (as fraction of unclipped read length)