fastq相关

时间 2019-12-07

标签 fastq 相关繁體版

原文原文链接

Sequence data formats

1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.

(1) GenBank file formathtml

GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found herelinux

When do we use the GenBank format?ios

GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.git

(2) FASTA formatgithub

在生物信息学中，FASTA格式是一种用于记录核酸序列或肽序列的文本格式，其中的核酸或氨基酸均以单个字母编码呈现。该格式同时还容许在序列以前定义名称和编写注释。这一格式最初由FASTA软件包定义，但现今已经是生物信息学领域的一项标准。
FASTA简明的格式下降了序列操纵和分析的难度，令序列可被文本处理工具和诸如Python、Ruby和Perl等脚本语言处理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式web

(3) FASTQ file formatshell

FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:数据库

fasta与fastq格式文件解读
Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
FASTQ文件中，一个序列一般由四行组成：
第一行以@开头，以后为序列的标识符以及描述信息（与FASTA格式的描述行相似）
第二行为序列信息
第三行以+开头，以后能够再次加上序列的标识及描述信息（可选）
第四行为质量得分信息，与第二行的序列相对应，长度必须与第二行相同
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Further reading:
Differences between FASTA, FASTQ and SAM formatsexpress

2. Databases that contain gene sequencing data

NCBI GEO: can search datasets (sequencing data from a series of participants)
NCBI SRA: can search sequencing data from individual participant
ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.

I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.ubuntu

Install Aspera connect on Ubuntu Linux

mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version

If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:

# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect

3. How to download SRA files from NCBI SRA database?

According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.

1. Download SRA files by using prefetch

I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra.

# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc 
source ~/.bashrc

Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.

mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt

Alternatively, you can use curl, wget or ftp to download from generated download links, but will be as slow as snail.

2. Convert SRA files to FASTQ files on the fly

This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.

cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1

other

R中修改个别变量名（reshape包）使用names（）函数

names(leadership)

names(leadership)[2] <- “testDate”

names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)

How do I remove part of a string?
https://stackoverflow.com/questions/9704213/r-remove-part-of-string
gsub
sub_str

# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin

# Download and unzip the file on the fly. 
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa

# Look at the file
cat chr22.fa | head -4

# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N  | wc -l

# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }'

-------------------------------------------------------------------------------------------------------

各基因组的对应关系

首先是NCBI对应UCSC，对应ENSEMBL数据库：

GRCh36 (hg18): ENSEMBL release_52.

GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.

GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.

能够看到ENSEMBL的版本特别复杂！！！很容易搞混！

可是UCSC的版本就简单了，就hg18,19,38, 经常使用的是hg19，可是我推荐你们都转为hg38

看起来NCBI也是很简单，就GRCh36,37,38，可是里面水也很深！

Feb 13 2014 00:00    Directory April_14_2003
Apr 06 2006 00:00    Directory BUILD.33
Apr 06 2006 00:00    Directory BUILD.34.1
Apr 06 2006 00:00    Directory BUILD.34.2
Apr 06 2006 00:00    Directory BUILD.34.3
Apr 06 2006 00:00    Directory BUILD.35.1
Aug 03 2009 00:00    Directory BUILD.36.1
Aug 03 2009 00:00    Directory BUILD.36.2
Sep 04 2012 00:00    Directory BUILD.36.3
Jun 30 2011 00:00    Directory BUILD.37.1
Sep 07 2011 00:00    Directory BUILD.37.2
Dec 12 2012 00:00    Directory BUILD.37.3

能够看到，有37.1, 37.2， 37.3 等等，不过这种版本通常指的是注释在更新，基因组序列通常不会更新！！！

反正你记住hg19基因组大小是3G，压缩后八九百兆便可！！！

若是要下载GTF注释文件，基因组版本尤其重要！！！

对NCBI：ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版（hg38）

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本

对于ensembl：

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

变幻中间的release就能够拿到全部版本信息：ftp://ftp.ensembl.org/pub/

对于UCSC，那就有点麻烦了：

须要选择一系列参数：

http://genome.ucsc.edu/cgi-bin/hgTables

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser

3. Click 'get output'.

如今重点来了，搞清楚版本关系了，就要下载呀！

UCSC里面下载很是方便，只须要根据基因组简称来拼接url便可：

http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz

或者用shell脚本指定下载的染色体号：

for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;

## 这里也能够用NCBI的：ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前缀
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta

---------------------------------------------------------------------------------------------------------

Tool: fastq-dump

Usage:

fastq-dump [options] <path/file> [<path/file> ...]

fastq-dump [options] <accession>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data formatting:
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number.
		--split-spot	Split spots into individual reads.
		--fasta <[line width]>	FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I	\|	--readids	Append read id after spot id as 'accession.spot.readid' on defline.
-F	\|	--origfmt	Defline contains only original sequence name.
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD).
-Q	\|	--offset <integer>	Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N	\|	--minSpotId <rowid>	Minimum spot id to be dumped. Use with "X" to dump a range.
-X	\|	--maxSpotId <rowid>	Maximum spot id to be dumped. Use with "N" to dump a range.
-M	\|	--minReadLen <len>	Filter by sequence length >= <len>
		--skip-technical	Dump only biological reads.
		--aligned	Dump only aligned sequences. Aligned datasets only; see sra-stat.
		--unaligned	Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O	\|	--outdir <path>	Output directory, default is current working directory ('.').
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream.
		--gzip	Compress output using gzip.
		--bzip2	Compress output using bzip2.

Use examples:

fastq-dump -X 5 -Z SRR390728

Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.

fastq-dump -I --split-files SRR390728

Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.

fastq-dump --split-files --fasta 60 SRR390728

Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).

fastq-dump --split-files --aligned -Q 64 SRR390728

Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.Possible errors and their solution:

fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table

This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.

fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra

The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here

--------------------------------------------------------------------------------------------------------

下载流程：

1：wget -i ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000001/SRR000001/SRR000001.sra

从NCBI官网下载sra数据文件

2：

使用fastq-dump工具将sra转换成双端fastq