Case Study Big Data and Data Storage: astronomical amounts of data at the Max Planck Institute

Seemingly endless vastness of data – managed by the Max Planck Institute of Radio Astronomy with OpenArchive by GRAU DATA: a case study about using an open source program for big data from Pulsa research.

When a research facility such as the Max Planck Institute is looking for new insights into deep space, they first of all find one thing: astronomical amounts of data. Therein lie the answers to big questions.

That is how the research group for radio astronomical fundamental physics at the Max Planck Institute explores the cosmic radio emission. For this, they analyze so-called pulsars to research the magnetic forces of the galaxy. Their observations and data enable nothing less than tests of the general theory of relativity and alternative theories of gravity.

Data for this is collected at the radio telescope Effelsberg, which generates about 100 gigabytes of data during only 30 minutes of measurements. Every month, around 18 terabytes of measured data is stored for calculation and analysis. Interpretation of this data takes notably longer. By law, the institute has to archive this data for 10 years – not even the blink of an eye, from the universe’s perspective. The data on outer space does not lose its relevance for centuries and could achieve a breakthrough for future research. For this reason, it has to be stored securely and for a very long time so that unhindered access is possible at all times. New algorithms are developed all the time, including “old” databases as well. The problem is: Storage space is not unlimited. It would blow the institute’s budgets by far to archive all data from radio telescopes onto hard drives, meaning online repositories Additionally, most of the data is not constantly in use and lies untouched on the storage units most of the time – using up resources.

This poses the challenge to store enormous amounts in the long term, securely, and at the same time cost-efficiently. The solution: a Hierarchical Storage Management concept on a basis of GRAU DATA

OpenArchive – an open source HSM and archiving software that can maintain several petabytes of data very efficiently – combined with LTO tapes as a medium for long-term archiving.

The countdown for outer space data storage with the HSM and archiving software OpenArchive starts in August 2011. As a first step, the software is ported into the Debian/GNU Linux operating system, as requested by the Max Planck Institute. Tests are completed successfully already in October and the complete solution is brought into service in November.

“The software is running perfectly reliably and also does not leave any administrative requirements unfulfilled,” is Jan Behrend’s, IT expert at the Max Planck Institute, comment on the successful project.

First of all, the astronomical measuring data from the radio telescope Effelsberg is buffered in 8 Gbit FC SAN on a 120 TB disk-online-storage. In terms of the server, powerful Fujitsu Primergy RX 300 S6 systems are available, redundantly shifting the data with GRAU DATA OpenArchive onto the Spectralogic LTO 5 Tape Libraries in Effelsberg and Bonn. In 2012, the archiving software manages around 350 tapes per library, each tape containing 1,5 terabyte (TiB) of data – and databases are quickly growing. Altogether, the amount of data rises to 525 terabytes until May 2012. The overall system, however, is expendable to up to 3,5 petabytes at the present time.

“As opposed to classic standard archiving systems in companies, the tape technology in our department of the Max Planck Institute is often used as expanded online storage, accessed by researchers on a regular basis,” Jan Behrend explains the storage space structure.

“Tape libraries in combination with OpenArchive are fast enough in a 1 Gbit/s network to provide the enormous amounts of data for our research groups. At the same time, this storage system offers us enormous cost advantages compared to classic online storage disks.”

The hardware-independent GRAU DATA OpenArchive in combination with Fujitsu servers is able to migrate large amounts of data very quickly onto tape libraries. Start data rate in this HSM system is one gigabit per second. Read-write speed reaches up to 130 MiB per second and per device, if the tapes are working to capacity, which is equivalent to ca. 500 GiByte per hour.

OpenArchive offers the IT team at the research institute easy and intuitive administration. Fill levels and transfer rates are constantly controlled by the software. If manual intervention was necessary, the administrator immediately receives a message. Daily backup of metadata into the remote location, too, happens fully automatically.

Because of the trouble-free operation of the

HSM and archiving software, the Max Planck Institute decided to use the multi-client capability of this solution and to incorporate two additional research groups into the whole system. Due to the multi-client capability, it was possible to create separate partitions which ensures separation of data and the separate use of drives and tapes.

A defining reason for using GRAU DATA archiving software was its availability as an open source program, besides broad functions and porting into the Linux operating software Debian. The Max Planck Institute, like most research institutes all over the world, relies on Linux as an operating system platform. With OpenArchive, GRAU DATA offers the only Linux based professional archiving software on an open source basis in the world.

“Considerations aim amongst possible cuts in expenses for licenses also at the fact that we as a research institute often design and write own applications. These are easier to tie into the open code of an archiving software in a consistent open source environment.” (Jan Behrend)

Do you want to learn more about HSM archiving with the open source program OpenArchive? Read more about the software here or contact us directly for individual advice.