2021 Summer boot camp

July 12-15, 2021

Agenda

Overview

The CAES Summer Boot Camp in Data Science is a virtual crash course, taking place on July 12-15 to educate researchers and students and enable rapid implementation of data science tools in their research. The virtual boot camp invites data science experts within and outside of CAES institutions and industry to give tutorials and presentations from data science basics to applications. Topics include data mining process, scientific visualization, machine learning, and industrial applications. The boot camp also hosts a research discussion panel for participants to share experiences and establish collaboration.

The boot camp is open to students, faculty, and researchers interested in using data science tools in their research. No prior knowledge of the tools to be presented is needed.

Event Materials

Registration Instructions/Link

Presenter Bios

Workshop Descriptions and Materials

Event Materials

Agenda

Videos and Powerpoints will be posted after the event.

Registration Instructions

Go here to register. Closer to the workshop dates, organizers will send meeting links for online participation.

In order to facilitate the best learning environment for participants, you agree to appear for your online seat by registering. If you cannot attend, please email [email protected] so we can open the seat for someone else.

Agenda

The following workshops will occur remotely on the listed dates. Please register only for the workshops you plan to attend, and register for as much or as little of the program as your schedule allows.

Date	Time (MST)	Instructor(s)	Institution(s)	Tutorial
July 12	9am-noon	Randall Reese	Idaho National Laboratory	RShiny and PyFlask: Building Interactive Web Apps for Data
July 12	1pm-5pm	Leslie Kerby	Idaho State University	NumPy, Pandas, and Scikit-Learn: Prediction with Decision Trees
July 13	9am-noon	Steve Cutchin	Boise State University	Visualization with ParaView
July 13	1pm-2pm	Roba Binyahib	University of Oregon	TBD
July 13	2pm-5pm	Eric Brugger	Lawrence Livermore National Laboratory	TBD
July 14	9am-10:15am	Benjamin Afflerbach	University of Wisconsin-Madison	An Introduction to Machine Learning for Materials Science: A Basic Workflow for Predicting Materials Properties
July 14	10:15am-noon	Ryan Jacobs	University of Wisconsin-Madison	The Materials Simulation Toolkit for Machine Learning (MAST-ML): Automating Development and Evaluation of Machine Learning Models for Materials Property Prediction
July 14	1pm-2pm	Mahmood Mamivand	Boise State University	The Informatics Skunkworks: A Program for Undergraduate Research at the Interface of Data Science and Materials Science and Engineering
July 14	2pm-5pm	Paul Bodily	Idaho State University	Crossing Darwin and Computer Science: The Staying Power of Evolutionary Algorithms
July 15	9am-noon	Sara Ewing, Matty Jones, Conrad Kennington	Kount	A Day in the Life of a Kount Data Scientist
July 15	1pm-3pm	Local Industry Expert(s)	TBD	Industry Applications of Artificial Intelligence / Machine Learning
July 15	3pm-5pm	Various	Various	Data Science Research Discussion Panel: Tools Applications, Networking, and Collaboration

Workshop Descriptions and Materials

RShiny and PyFlask: Building Interactive Web Apps for Data. Building data dashboards accessible via the Internet is an excellent way to allow users to interface with data using tools they are already familiar with. This tutorial will teach participants how to build their own web-based data dashboards using RShiny (in R) and Flask (in Python). We will begin with building a simple example of each tool, then move to advanced applications of these packages.

NumPy, Pandas, and Scikit-Learn: Prediction with Decision Trees. Numpy and pandas are the foundation of the python data science stack: most python machine learning libraries utilize their objects and data structures, including scikit-learn. Come learn the basics of numpy and pandas, and learn how to build and train decision trees for classification.

An Introduction to Machine Learning for Materials Science: A Basic Workflow for Predicting Materials Properties. This tutorial will introduce core concepts of machine learning through the lens of a basic workflow to predict material bandgaps from material compositions. As we progress through this workflow we will highlight key steps, challenges that can come up with materials data, and potential solutions to these challenges. The core workflow we will introduce includes Data Cleaning, Feature Generation, Feature Engineering, Establishing Model Assessment, Training a Default Model, Hyperparameter Optimization, and Making Predictions. By the end of the tutorial I hope that you will have a better understanding of these core concepts, and how they can all fit together. If you want to preview the materials ahead of time you can find them on Nanohub here: https://nanohub.org/tools/intromllab

The Materials Simulation Toolkit for Machine Learning (MAST-ML): Automating Development and Evaluation of Machine Learning Models for Materials Property Prediction. This tutorial contains an introduction to the use of the Materials Simulation Toolkit for Machine Learning (MAST-ML), a python package designed to broaden and accelerate the use of machine learning and data science methods for materials property prediction. Through hands-on activities, we will use MAST-ML to (1) import materials datasets from online databases and clean and examine our input data, (2) conduct feature engineering analysis, including generation, preprocessing, and selection of features, (3) construct, evaluate and compare the performance of different model types and data splitting techniques, and (4) conduct a preliminary assessment of model error analysis and uncertainty quantification (UQ). MAST-ML Tool Github page: https://github.com/uw-cmg/MAST-ML

The Informatics Skunkworks: A Program for Undergraduate Research at the Interface of Data Science and Materials Science and Engineering.In this presentation, I will go over the new infrastructure and ecosystem that we are developing for the engagement and training of undergraduate students (UGs) in research at the interface of data science and materials science and engineering, with a focus on the use of applied machine learning (ML) in materials informatics. I will describe the resources that we have developed to lower barriers to starting research projects, including (a) curriculum to train UGs in relevant data science and materials informatics, (b) software tools that augment existing ML packages to be UG accessible, and (c) authentic and appropriate-level research problems.

Crossing Darwin and Computer Science: The Staying Power of Evolutionary Algorithms. Beyond revolutionizing our views on life and the world in which we live, Darwin's theory of evolution has been the basis and ongoing inspiration for an entire branch of machine learning. Evolutionary algorithms are frequently used to tackle some of Computer Science's most nefarious challenges'the notorious NP-complete problems'in applications as varied as mirrors designed to funnel sunlight to a solar collector, antennae designed to pick up radio signals in space, walking methods for computer figures, and optimal design of aerodynamic bodies in complex flowfields. In this tutorial, Dr. Bodily will lay out the theory behind genetic algorithms, illustrate several applied examples of genetic algorithms from his research and other real-world applications, and will involve participants in an interactive, live-coding demo to implement a genetic algorithm that can be repurposed for a variety of applications. Come prepared for a fun, engaging, rewarding learning experience!

Data Science Research Discussion Panel: Tools Applications, Networking, and Collaboration: Join us to hear our presenters share their experience with the data science tools they use in their research, their plans for future projects and grants, and how they recommend students continue growing their skills in these areas.

Presenter Bios

Randall Reese, Ph.D. Randall Reese is currently a data scientist at Idaho National Laboratory. He holds bachelor's and master's degrees in mathematics and a PhD in statistics, with an emphasis in computational statistics. Formerly from Missoula, Montana, he now resides in Idaho Falls, Idaho.

Kerby.png?fit=scale&fm=png&h=300&ixlib=php 3.3 TEMPLATE: Single Event Leslie Kerby, Ph.D., M.B.A. Leslie Kerby is the director of Computational Engineering And Data Science (CEADS). Research interests are interdisciplinary and include computational science, data science, and nuclear science and engineering.

Steve Cutchin, Ph.D. Steve Cutchin is the director of Research Computing at Boise State, faculty in the Computer Science Department. Research interests include scientific data visualization, immersive environments, serious games.

Benjamin Afflerbach. Benjamin Afflerbach is a graduate student in the Department of Materials Science and Engineering, at University of Wisconsin-Madison. His work has focused on machine learning predictions of metallic glass forming ability.

Ryan Jacobs, Ph.D. Ryan Jacobs is a Research Scientist with the Department of Materials Science and Engineering, University of Wisconsin-Madison. His work focuses on using atomistic modeling and machine learning to understand the structure and properties of materials at the atomic scale, with a particular focus on the discovery and engineering of novel material compounds.

Mahmood Mamivand, Ph.D. Mahmood Mamivand is an assistant professor at the Department of Mechanical and Biomedical Engineering at Boise State. Dr. Mamivand's research lies at the intersection of Computational Materials Science and Materials Informatics, with a particular focus on microstructure-mediated materials design.

Paul Bodily, Ph.D. Paul Bodily is an assistant professor of Computer Science in the Computer Science Department and head of the Computational Creativity and Intelligence Lab (CCIL) at Idaho State University. His research addresses the question of whether or not computers, beyond possessing artificial intelligence, can exhibit autonomous creativity. His primary research interest focuses particularly on the domain of lyrical music composition and the challenge of invoking long-term structure in sequence generation.

Sara Ewing, Matty Jones, Ph.D., Conrad Kennington are data scientists at Kount and will be presenting A Day in the Life of a Kount Data Scientist, and will be using Jupyter notebooks on Google Colab to share representative workflows from their team.

Organization Committee:

Lan Li (BSU)
Leslie Kerby (ISU)
Eric Jankowski (BSU)
Steven Cutchin (BSU)
Mahmood Mamivand (BSU)
Mendi Edgar (BSU)
Lawrence Spear (BSU)
Hillary K. Fishler (CAES)