BackgroundThe US Dept of Energy Joint Genome Institute (DOE JGI) is to advance genomics in support of DOE missions related to clean energy generation and environmental characterization and cleanup. Key mission areas are bioenergy, carbon cycle and biogeochemistry.
ChallengesDOE JGI is at the forefront of large scale sequence based science, responsible for the
QA/QC of sequencing data as it is generated by distributed production facilities.With lots of assemblers and many previous projects that are publication based, they faced challenges with data sequencing being open to subjectivity and bias. They needed to develop pipelines to address the lack of standardization when continually assembling and sequencing lots of genome data to ensure that the data produced and released to the community is high quality and met specific standards. Their existing processes were wasteful and inefficient with assembly software that is difficult and time consuming to set up and get going making it impossible to objectively compare results of the quality and performance of the assemblers.
SolutionWith Docker containers DOE JGI was able to improve on the assembler project approach by running benchmarks on assemblers so they can be objectively evaluated and crowd source the assemblers from the bioinformatics community. Using Linux Containers running Docker all genome assemblers and associated pipeline were built within a Docker image and then hosted on Docker Hub. The benchmarking pipeline is now able to pull the image and run it against an array of reference data sets. The produced assembly can now be evaluated against the reference sequence using Quast, a quality assessment tool for genome assemblies, with the assembly metrics and results then posted on the site. By simplifying technology and automating processes, researcher’s had more time for science and improved its quality. Docker containerization provided a standardized pipeline with consistent APIs, and even a catalog, leading to objective comparison of the tools and the results. Researchers can now have a data driven conversation and easily share assemblers & results and data with each other leading to better science.