(1) GenBank file formathtml
GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found herelinux
When do we use the GenBank format?ios
GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.git
(2) FASTA formatgithub
在生物信息学中,FASTA格式是一种用于记录核酸序列或肽序列的文本格式,其中的核酸或氨基酸均以单个字母编码呈现。该格式同时还容许在序列以前定义名称和编写注释。这一格式最初由FASTA软件包定义,但现今已经是生物信息学领域的一项标准。
FASTA简明的格式下降了序列操纵和分析的难度,令序列可被文本处理工具和诸如Python、Ruby和Perl等脚本语言处理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式web
(3) FASTQ file formatshell
FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:数据库
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Further reading:
Differences between FASTA, FASTQ and SAM formatsexpress
I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.ubuntu
Install Aspera connect on Ubuntu Linux
mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz bash aspera-connect-3.7.4.147727-linux-64.sh # Installing Aspera Connect # Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only. # Unable to update desktop database, Aspera Connect may not be able to auto-launch # Restart firefox manually to load the Aspera Connect plug-in # Install complete. # construct soft link sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp ascp -h # help ascp -A # version
If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:
# ~/.mozilla/plugins/libnpasperaweb.so # ~/.aspera/connect rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so yes|rm -rf ~/.aspera/connect
According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.
1. Download SRA files by using prefetch
I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit
because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra
.
# Install SRAtoolkit mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz # You echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >> ~/.bashrc source ~/.bashrc
Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.
mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240 # manually generate SRA file list touch GSE48240.txt for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done # Using efetch to generate SRA file list esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt
Alternatively, you can use curl
, wget
or ftp
to download from generated download links, but will be as slow as snail.
2. Convert SRA files to FASTQ files on the fly
This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.
cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1
names(leadership) names(leadership)[2] <- “testDate” names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)
# install bioawk apt-get install bison cd ~/biosoft git clone https://github.com/lh3/bioawk cd bioawk make sudo cp bioawk /usr/local/bin
# Download and unzip the file on the fly. curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa # Look at the file cat chr22.fa | head -4 # Count how many "N" are in chr22 sequence cat chr22.fa | grep -o N | wc -l # Count how many bases are in Chr22? cat chr22.fa | bioawk -c fastx '{ print length($seq) }'
-------------------------------------------------------------------------------------------------------
各基因组的对应关系
首先是NCBI对应UCSC,对应ENSEMBL数据库:
GRCh36 (hg18): ENSEMBL release_52.
GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.
能够看到ENSEMBL的版本特别复杂!!!很容易搞混!
可是UCSC的版本就简单了,就hg18,19,38, 经常使用的是hg19,可是我推荐你们都转为hg38
看起来NCBI也是很简单,就GRCh36,37,38,可是里面水也很深!
Feb 13 2014 00:00 Directory April_14_2003 Apr 06 2006 00:00 Directory BUILD.33 Apr 06 2006 00:00 Directory BUILD.34.1 Apr 06 2006 00:00 Directory BUILD.34.2 Apr 06 2006 00:00 Directory BUILD.34.3 Apr 06 2006 00:00 Directory BUILD.35.1 Aug 03 2009 00:00 Directory BUILD.36.1 Aug 03 2009 00:00 Directory BUILD.36.2 Sep 04 2012 00:00 Directory BUILD.36.3 Jun 30 2011 00:00 Directory BUILD.37.1 Sep 07 2011 00:00 Directory BUILD.37.2 Dec 12 2012 00:00 Directory BUILD.37.3
能够看到,有37.1, 37.2, 37.3 等等,不过这种版本通常指的是注释在更新,基因组序列通常不会更新!!!
反正你记住hg19基因组大小是3G,压缩后八九百兆便可!!!
若是要下载GTF注释文件,基因组版本尤其重要!!!
对NCBI:ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版(hg38)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本
对于ensembl:
ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
变幻中间的release就能够拿到全部版本信息:ftp://ftp.ensembl.org/pub/
对于UCSC,那就有点麻烦了:
须要选择一系列参数:
http://genome.ucsc.edu/cgi-bin/hgTables
1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser3. Click 'get output'.
如今重点来了,搞清楚版本关系了,就要下载呀!
UCSC里面下载很是方便,只须要根据基因组简称来拼接url便可:
http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz
或者用shell脚本指定下载的染色体号:
for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;## 这里也能够用NCBI的:ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前缀
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta
---------------------------------------------------------------------------------------------------------
Usage:
fastq-dump [options] <path/file> [<path/file> ...]
fastq-dump [options] <accession>
Frequently Used Options:
General: | |||
-h | | | --help | Displays ALL options, general usage, and version information. |
-V | | | --version | Display the version of the program. |
Data formatting: | |||
--split-files | Dump each read into separate file. Files will receive suffix corresponding to read number. | ||
--split-spot | Split spots into individual reads. | ||
--fasta <[line width]> | FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping). | ||
-I | | | --readids | Append read id after spot id as 'accession.spot.readid' on defline. |
-F | | | --origfmt | Defline contains only original sequence name. |
-C | | | --dumpcs <[cskey]> | Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation. |
-B | | | --dumpbase | Formats sequence using base space (default for other than SOLiD). |
-Q | | | --offset <integer> | Offset to use for ASCII quality scores. Default is 33 ("!"). |
Filtering: | |||
-N | | | --minSpotId <rowid> | Minimum spot id to be dumped. Use with "X" to dump a range. |
-X | | | --maxSpotId <rowid> | Maximum spot id to be dumped. Use with "N" to dump a range. |
-M | | | --minReadLen <len> | Filter by sequence length >= <len> |
--skip-technical | Dump only biological reads. | ||
--aligned | Dump only aligned sequences. Aligned datasets only; see sra-stat. | ||
--unaligned | Dump only unaligned sequences. Will dump all for unaligned datasets. | ||
Workflow and piping: | |||
-O | | | --outdir <path> | Output directory, default is current working directory ('.'). |
-Z | | | --stdout | Output to stdout, all split data become joined into single stream. |
--gzip | Compress output using gzip. | ||
--bzip2 | Compress output using bzip2. |
Use examples:
fastq-dump -X 5 -Z SRR390728
Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.
fastq-dump -I --split-files SRR390728
Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.
fastq-dump --split-files --fasta 60 SRR390728
Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).
fastq-dump --split-files --aligned -Q 64 SRR390728
Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.Possible errors and their solution:
fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table
This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.
fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra
The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here
--------------------------------------------------------------------------------------------------------
下载流程:
1:wget -i ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000001/SRR000001/SRR000001.sra
从NCBI官网下载sra数据文件
2:
使用fastq-dump工具将sra转换成双端fastq