############################################################################ UCSC procedure to mirror genbank/refseq assemblies and generate assembly hubs This README.txt file is from: http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/genbank/README.txt ############################################################################ # mirror genbank/refseq hierarchy There is a cron job running once a day M-F that runs an rsync command to fetch a select set of files from genbank/refseq. The script here is: cronUpdate.sh with one argument specifying either the refseq or genbank assembly hierarchy, e.g.: cronUpdate.sh refseq The accumulating hierarchy at UCSC is in: export outside="/hive/data/outside/ncbi/genomes/refseq" Or, if genbank assemblies: export outside="/hive/data/outside/ncbi/genomes/genbank" The rsync command has a number of include statements to pick up a select set of files: cd "${outside}" rsync -avPL --include "*/" --include="chr2acc" --include="*chr2scaf" \ --include "*placement.txt" --include "*definitions.txt" \ --include "*_rm.out.gz" --include "*agp.gz" --include "*regions.txt" \ --include "*_genomic.fna.gz" --include "*gff.gz" \ --include "*_assembly_report.txt" --exclude "*" \ rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/${D}/ ./${D}/ \ || /bin/true The "${D}" directory name is one of the directories at ftp.ncbi.nlm.nih.gov/genomes/{refseq|genbank}/${D}/ Slightly different directory listings for refseq vs. genbank. For refseq assemblies: for D in archaea bacteria fungi invertebrate mitochondrion plant plasmid \ plastid protozoa vertebrate_mammalian vertebrate_other viral For genbank assemblies: for D in archaea bacteria fungi invertebrate other plant protozoa \ vertebrate_mammalian vertebrate_other The cron script also maintains file listings and logs in: ${outside}/fileListings/yyyy/mm/ ${outside}/logs/yyyy/mm/ to track changes over time and provide a record of activity ############################################################################ # processing NCBI files into assembly hub files Processing takes place in the two hierarchies: /hive/data/inside/ncbi/genomes/refseq/ /hive/data/inside/ncbi/genomes/genbank/ The scripts from this source directory: kent/src/hg/utils/automation/genbank/ are used to process the NCBI files into assembly hub files. This processing is done with a cluster computer set up since there are a large number of steps to perform. The primary driving script is: ncbiToUcsc.sh It takes two arguments: 1. genbank|refseq - to specify which hierarchy to work on 2. /*_genomic.fna.gz - fasta sequence file in the 'outside' path ############################################################################