Important Considerations |
May 4, 2020 |
Databases are
downloaded as gzip-compressed tar archive files (ie. .tar.gz
files). When uncompressed, the combined files in any given
database will require anywhere from 1.1 to 14 times the size of
the .tar.gz archive files. For most databases, the ratio of
decompressed to compressed is around 2.5. The table below gives
a snapshot of sizes and decompression ratios for NCBI databases
downloaded May 4, 2020.
blastdbkit.py: | LOCAL BLAST DATABASE REPORT | ||
Source: |
localhost |
ftp.ncbi.nih.gov | |
Database Directory: | /home/birch/GenBank | /blast/db | |
DB name | uncompressed size (Mb) | compressed size (Mbytes)* | decompression ratio |
nt | 149731 | 64046 | 2.34 |
refseq_rna | 43752 | 15815 | 2.77 |
human_genome | 1566 | 1192 | 1.35 |
mouse_genome | 1322 | 1025 | 1.33 |
ref_euk_rep_genomes | 199793 | 186603 | 1.07 |
ref_prok_rep_genomes | 12651 | 12368 | 1.03 |
ref_viroids_rep_genomes | 0 | 30 | 1.00 |
ref_viruses_rep_genomes | 82 | 108 | 1.05 |
patnt | 17847 | 6709 | 2.67 |
pdbnt | 14 | 31 | 14.00 |
16S_ribosomal_RNA | 14 | 36 | 2.33 |
18S_fungal_sequences | 1 | 30 | 1.00 |
28S_fungal_sequences | 3 | 31 | 3.00 |
ITS_RefSeq_Fungi | 5 | 32 | 2.50 |
ITS_eukaryote_sequences | 37 | 39 | 4.11 |
LSU_eukaryote_rRNA | 5 | 32 | 2.50 |
LSU_prokaryote_rRNA | 2 | 31 | 2.00 |
SSU_eukaryote_rRNA | 6 | 32 | 3.00 |
Betacoronavirus | 29 | 38 | 3.63 |
nr | 404350 | 98231 | 4.12 |
refseq_protein | 170800 | 51766 | 3.30 |
swissprot | 695 | 183 | 4.54 |
pdbaa | 267 | 63 | 8.09 |
landmark | 374 | 160 | 2.88 |
env_nt | 79383 | 40140 | 1.98 |
taxdb | 161 | 30 | 5.37 |
TOTAL: | 1082890 | ||
*Compressed database files for each division of the database include the taxid database, which at this writing is about 30 Mb. Thus, although the total sizes of compressed files for ITS_eukaryote_sequences is 39 Mb, 30 Mb of that is a copy of taxdb. Each time the database files are de-archived, any existing copy of taxdb is overwritten. Consequently, there will only be one copy of taxdb after each install or update, even though many copies may have been downloaded. Decompression ratios in the table are calculated after subtracting the compressed size of taxdb (eg. 30 Mb) from the compressed size. |
Memory and Cores*
Unless you have adequate numbers of CPUs and RAM, it
may be pointless to install local copies of BLAST databases.
Standalone BLAST+ uses a great deal of memory, and may be unreasonably slow on machines without adequate numbers of cores (CPUs). For most databases, you probably want a minimum of 8 cores and 16 Gb. RAM. The good news is that it is relatively cheap to upgrade a computer to a configuration that will give you faster turnaround times than sending your jobs to NCBI.
A future
version of this document will include some statistics on
search times on computer systems with various configurations
of CPU and RAM.
*The
distinction between cores and CPUs is as follows:
A
Central Processing Unit performs operations on data in RAM.
Originally, the CPU had one processor. Today, the vast majority
of CPUs manufactured today, even on low-end PCs, have 2 or
more cores, each of which can process information independently.
The terms CPU and core, while not synonymous, are often used
interchangeable. Strictly speaking, it is most correct to use
the term core to refer to the number of processing units.
A hard-wired
(Ethernet) connection is best. In most institutional
settings, each computer will have a separate IP address on a
switch.
Wifi may be
too slow or not have adequate bandwith. Downloads will almost
certainly take longer on Wifi. As well, since Wifi is a shared
resource, large downloads may affect the Wifi performance for
others nearby.
There are
several FTP sites open to the general public for downloading
copies of the NCBI BLAST databasese. Since these "mirrors" are
kept in sync with NCBI, the primary consideration should be
minimizing the amount of network load that your downloads
generate, as a courtesy to others, as well as download speed. As
well, since network traffic can often result in dropped
connections during a download.
For all these
reasons, it is usually best to download files from the FTP site
geographically closest to your location.
FTP site |
Directory for BLAST
file downloads |
Location |
ftp.ncbi.nih.gov |
/blast/db |
Bethesda, Maryland, USA |
ftp.ebi.ac.uk |
pub/blast/db |
EBI, UK |
ftp.hgc.jp |
pub/mirror/ncbi/blast/db |
Tokyo, Japan |