1687
(Invited) Best Practices for Building Robust and Usable Materials Datastores

Wednesday, 31 May 2017: 16:15
Grand Salon C - Section 13 (Hilton New Orleans Riverside)
D. Gunter (LBNL)
For many years, but accelerated by the Materials Genome Initiative in 2011, there has been increasing awareness across the materials science community of the importance of the confluence of computational tools, experimental tools, and digital data. Algorithms and supercomputers have become capable of computing characteristics that can predict real-world materials performance; networks and storage can interface with high-performance computing to sift through petabytes of information from experimental instruments. Some call his confluence of computer science and materials science knowledge "materials informatics". One of the central challenges of this confluence of computer science and materials science knowledge, often called "materials informatics", is building datastores that can ingest, organize, curate, archive, disseminate, and help share and analyze materials data. Due to the nearly universal use of computing, this is arguably the most cross-cutting activity in materials science research today. Yet this activity occurs largely in the shadows. There is very little visibility of existing practices and available infrastructure from established projects such as the Materials Project or the Center for Hierarchical Materials Design, as well as little guidance for the materials science community, or sub-communities, on how to navigate the bewildering array of database products and connect these to websites or analysis environments. Despite the existence of many independent efforts, there has been little discussion on how data can be exchanged or validated between them, or scientists' lives made easier by the invention of common access patterns. The Energy Materials Network, as a cross-cutting effort, represents a significant opportunity to address some of these core social and technological challenges and bring the materials informatics efforts "into the light" of the more general scientific software community. This talk will focus on four aspects, with the overall goal being to provoke thought and discussion: (1) how to approach the scoping and design of databases, so you can solve the right problem for the right community, (2) a highly selective tour of some key technologies and design decisions in existing materials databases, and (3) considerations and best practices for the overall data lifecycle, and (4) thoughts on how the open-source development model may apply to materials data hubs. These points will draw on the speaker's long experience creating software for scientific projects in fields including high-energy physics, bioinformatics, process engineering, and materials science.