2021 Summer boot camp

July 12-15, 2021

Overview

The CAES Summer Boot Camp in Data Science is a virtual crash course, taking place on July 12-15 to educate researchers and students and enable rapid implementation of data science tools in their research. The virtual boot camp invites data science experts within and outside of CAES institutions and industry to give tutorials and presentations from data science basics to applications. Topics include data mining process, scientific visualization, machine learning, and industrial applications. The boot camp also hosts a research discussion panel for participants to share experiences and establish collaboration.

The boot camp is open to students, faculty, and researchers interested in using data science tools in their research. No prior knowledge of the tools to be presented is needed.

Event Materials

Z

Agenda

Videos and Powerpoints will be posted as presenters provide them.

Registration Instructions

Go here to register. Closer to the workshop dates, organizers will send meeting links for online participation.

In order to facilitate the best learning environment for participants, you agree to appear for your online seat by registering. If you cannot attend, please email researchcomputing@boisestate.edu so we can open the seat for someone else.

Agenda

The following workshops will occur remotely on the listed dates. Please register only for the workshops you plan to attend, and register for as much or as little of the program as your schedule allows.

David Beery — EphesoftCyrus Harrison and Matt Larsen,

DATE/TIME (MST) INSTRUCTORS TUTORIAL
July 12 9am-12pm Randall Reese — Idaho National Laboratory RShiny and (Py)Dash: Building Interactive Web Apps for Data
July 12 1pm-5pm

Leslie Kerby — Idaho State

University

NumPy, Pandas, and Scikit Learn: Prediction with

Decision Trees

July 13 9am-12pm Steve Cutchin — Boise State University Visualization with ParaView
July 13 1pm-2pm Roba Binyahib — University of Oregon Visualization Design Patterns
July 13 2pm-5pm Eric Brugger, Cyrus Harrison and Matt Larsen — Lawrence Livermore National Laboratory Scientific Visualization with VisIt and Ascent
July 14 9am-10:15am Benjamin Afflerbach University of Wisconsin -Madison

An Introduction to Machine Learning for Materials

Science: A Basic Workflow for Predicting Materials

Properties

July 14 10:15am-12pm Ryan Jacobs — University of Wisconsin — Madison

The Materials Simulation Toolkit for Machine Learning (MAST-ML): Automating Development and Evaluation of Machine Learning Models for Materials Property

Prediction

July 14 1pm-2pm Mahmood Mamivand — Boise State University The Informatics Skunkworks: A Program for Undergraduate Research at the Interface of Data Science and Materials Science and Engineering
July 14 2pm-5pm Paul Bodily — Idaho State University

Genetic/Evolutionary Algorithms

July 15 9am-12pm Sara Ewing, Matty Jones, Conrad Kennington - Kount A Day in the Life of a Kount Data Scientist
July 15 1pm-3pm David Beery — Ephesoft

Industry Applications of

Artificial Intelligence / Machine Learning

July 15 3pm-5pm Various — Various Data Science Research Discussion Panel: Tools Applications, Networking, and Collaboration

Workshop Descriptions and Materials

RShiny and (Py)Dash: Building Interactive Web Apps for Data. Building data dashboards accessible via the Internet is an excellent way to allow users to interface with data using tools they are already familiar with. This tutorial will teach participants how to build their own web-based data dashboards using RShiny (in R) and (Py)Dash (in Python). We will begin with building a simple example of each tool, then move to advanced applications of these packages. Presentation slides are available here (pdf).

NumPy, Pandas, and Scikit-Learn: Prediction with Decision Trees. Numpy and pandas are the foundation of the python data science stack: most python machine learning libraries utilize their objects and data structures, including scikit-learn. Come learn the basics of numpy and pandas, and learn how to build and train decision trees for classification.

An Introduction to Machine Learning for Materials Science: A Basic Workflow for Predicting Materials Properties. This tutorial will introduce core concepts of machine learning through the lens of a basic workflow to predict material bandgaps from material compositions. As we progress through this workflow we will highlight key steps, challenges that can come up with materials data, and potential solutions to these challenges. The core workflow we will introduce includes Data Cleaning, Feature Generation, Feature Engineering, Establishing Model Assessment, Training a Default Model, Hyperparameter Optimization, and Making Predictions. By the end of the tutorial I hope that you will have a better understanding of these core concepts, and how they can all fit together. If you want to preview the materials ahead of time you can find them on Nanohub here: https://nanohub.org/tools/intromllab

The Materials Simulation Toolkit for Machine Learning (MAST-ML): Automating Development and Evaluation of Machine Learning Models for Materials Property Prediction. This tutorial contains an introduction to the use of the Materials Simulation Toolkit for Machine Learning (MAST-ML), a python package designed to broaden and accelerate the use of machine learning and data science methods for materials property prediction. Through hands-on activities, we will use MAST-ML to (1) import materials datasets from online databases and clean and examine our input data, (2) conduct feature engineering analysis, including generation, preprocessing, and selection of features, (3) construct, evaluate and compare the performance of different model types and data splitting techniques, and (4) conduct a preliminary assessment of model error analysis and uncertainty quantification (UQ). MAST-ML Tool Github page: https://github.com/uw-cmg/MAST-ML

The Informatics Skunkworks: A Program for Undergraduate Research at the Interface of Data Science and Materials Science and Engineering.In this presentation, I will go over the new infrastructure and ecosystem that we are developing for the engagement and training of undergraduate students (UGs) in research at the interface of data science and materials science and engineering, with a focus on the use of applied machine learning (ML) in materials informatics. I will describe the resources that we have developed to lower barriers to starting research projects, including (a) curriculum to train UGs in relevant data science and materials informatics, (b) software tools that augment existing ML packages to be UG accessible, and (c) authentic and appropriate-level research problems.

Crossing Darwin and Computer Science: The Staying Power of Evolutionary Algorithms. Beyond revolutionizing our views on life and the world in which we live, Darwin's theory of evolution has been the basis and ongoing inspiration for an entire branch of machine learning. Evolutionary algorithms are frequently used to tackle some of Computer Science's most nefarious challenges'the notorious NP-complete problems'in applications as varied as mirrors designed to funnel sunlight to a solar collector, antennae designed to pick up radio signals in space, walking methods for computer figures, and optimal design of aerodynamic bodies in complex flowfields. In this tutorial, Dr. Bodily will lay out the theory behind genetic algorithms, illustrate several applied examples of genetic algorithms from his research and other real-world applications, and will involve participants in an interactive, live-coding demo to implement a genetic algorithm that can be repurposed for a variety of applications. Come prepared for a fun, engaging, rewarding learning experience!

Data Science Research Discussion Panel: Tools Applications, Networking, and Collaboration: Join us to hear our presenters share their experience with the data science tools they use in their research, their plans for future projects and grants, and how they recommend students continue growing their skills in these areas.

A Day in the Life of a Kount Data Scientist: Kount is the industry leading provider of 3rd party digital payment fraud protection. In this workshop, the Kount Data Science team will use Jupyter Notebooks on Google Colab to show how fraudulent online e-commerce transactions can be identified and stopped before they happen.

Presenter Bios

Reese Picture.jpg?fit=scale&fm=pjpg&h=291&ixlib=php 3.3 2021 Summer Boot CampRandall Reese, Ph.D.
Randall Reese is currently a data scientist at Idaho National Laboratory. He holds bachelor's and master's degrees in mathematics and a PhD in statistics, with an emphasis in computational statistics. Formerly from Missoula, Montana, he now resides in Idaho Falls, Idaho.

Kerby.png?fit=scale&fm=png&h=300&ixlib=php 3.3 2021 Summer Boot CampLeslie Kerby, Ph.D., M.B.A.
Leslie Kerby is the director of Computational Engineering And Data Science (CEADS). Research interests are interdisciplinary and include computational science, data science, and nuclear science and engineering.

SSteve Cutchin 1.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot Campteve Cutchin, Ph.D.
Steve Cutchin is the director of Research Computing at Boise State, faculty in the Computer Science Department. Research interests include scientific data visualization, immersive environments, serious games.

Afflerbach.png?fit=scale&fm=png&h=277&ixlib=php 3.3 2021 Summer Boot CampBenjamin Afflerbach
Benjamin Afflerbach is a graduate student in the Department of Materials Science and Engineering, at University of Wisconsin-Madison. His work has focused on machine learning predictions of metallic glass forming ability.

Jacobs.png?fit=scale&fm=png&h=300&ixlib=php 3.3 2021 Summer Boot CampRyan Jacobs, Ph.D.
Ryan Jacobs is a Research Scientist with the Department of Materials Science and Engineering, University of Wisconsin-Madison. His work focuses on using atomistic modeling and machine learning to understand the structure and properties of materials at the atomic scale, with a particular focus on the discovery and engineering of novel material compounds.

Mamivand.png?fit=scale&fm=png&h=300&ixlib=php 3.3 2021 Summer Boot CampMahmood Mamivand, Ph.D.
Mahmood Mamivand is an assistant professor at the Department of Mechanical and Biomedical Engineering at Boise State. Dr. Mamivand's research lies at the intersection of Computational Materials Science and Materials Informatics, with a particular focus on microstructure-mediated materials design.

Bodily.png?fit=scale&fm=png&h=300&ixlib=php 3.3 2021 Summer Boot CampPaul Bodily, Ph.D.
Paul Bodily is an assistant professor of Computer Science in the Computer Science Department and head of the Computational Creativity and Intelligence Lab (CCIL) at Idaho State University. His research addresses the question of whether or not computers, beyond possessing artificial intelligence, can exhibit autonomous creativity. His primary research interest focuses particularly on the domain of lyrical music composition and the challenge of invoking long-term structure in sequence generation.

Ewing.png?fit=scale&fm=png&h=300&ixlib=php 3.3 2021 Summer Boot CampSarah Ewing
Sarah Ewing is a data scientist at Kount and has been a member of the team for 4 months. She is the host for the YouTube and Podcast Sarah in Tech, organizer for the Boise Data Science Meet-Up, and volunteer at Idaho Technology Council. She is very passionate about data education. She is also a highly motivated data scientist with experience in the fraud detection, medical, nuclear, agriculture, psychological, and engineering fields. This has given her 7 years of unique opportunities to implement algorithms to solve a variety of problems. In her free time, she is a mom to a three year old girl and a labradoodle.

Matty Jones.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot CampDr. Matty Jones
Matty Jones is the data science manager at Kount and has been a member of the team for two years. He has always been an advocate for efficiently solving problems using computers. He obtained his MPhys. and Ph.D. at Durham University in the United Kingdom in Theoretical Physics and Engineering respectively, using various computational techniques to explore the very large (star formation in the early universe) and the very small (electrons moving through plastic solar cells). During his Postdoc at Boise State in Materials Science he discovered how Machine Learning can massively simplify expensive computational calculations and simulations, while still making state-of-the-art predictions -- a key factor in real-time fraud prediction. In his free time, he plays guitar, video games, and sings.

Monnig scaled.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot CampDr. Nate Monnig
Nate Monnig is the principal data scientist at Kount and has been a member of the team for three years. He earned a B.A. in Physics from Dartmouth College and a Ph.D. in Applied Mathematics from the University of Colorado Boulder. He has extensive research experience in computational methods for the analysis of large graphs and networks, as well as the development of advanced tracking algorithms for air and missile as well as cyber defense applications. Nate led the research effort to develop Kount's Omniscore and is continuing to actively develop innovative strategies to leverage Kount's Identity Trust Global Network to enhance Kount's payment fraud and digital account protection products. In his free time, Nate enjoys mountain biking, skiing, and spending time camping and river tripping with his wife and kids.

Murli.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot CampDr. Divy Murli
Divy Murli is a data scientist at Kount and has been a member of the team for two years. He previously received his B.S. and Ph.D. degrees in physics respectively from UCSB and Stanford. He became interested in data science and machine learning towards the end of his Ph.D., seeking to be able to apply his quantitative and mathematical skillset to solve valuable problems in industry. Data science sits at the perfect intersection of statistics and computation, and poses lots of interesting problems for a mathematically minded person. In his free time, Divy enjoys road cycling, unicycling and cooking. Once pandemic restrictions fully lift, he's looking forward to travelling internationally again!

Harrison photo.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot CampCyrus Harrison
Cyrus Harrison is a Computer Scientist and Section Leader in Lawrence Livermore National Laboratory's Computing Directorate. He develops data management, analysis, and visualization tools that support HPC multi-physics simulations. He is the software architect of the VisIt open-source visualization tool and leads major aspects of the technical direction of the project.

Brugger photo.jpg?fit=scale&fm=pjpg&h=300&ixlib=php 3.3 2021 Summer Boot CampEric Brugger
Eric Brugger has over 30 years' experience developing and using scientific visualization and analysis software. He is the VisIt project leader and one of the original developers of the software. He received an R&D 100 award in 2005 as part of the development team of VisIt. He has extensive experience assisting users visualize and understand their simulation data as well as providing hands on VisIt training in workshop settings.

David Beery.jpg?fit=scale&fm=pjpg&h=291&ixlib=php 3.3 2021 Summer Boot CampDavid Beery
David Beery is the data science team lead at Ephesoft, a company focusing on document processing and automation. Before Ephesoft he worked in a number of data science and machine learning positions including computational photography, quantum biology and healthcare. He began working as a data scientist in natural language processing in 2013 but enjoys computer vision most.

Matt Larsen.png?fit=scale&fm=png&h=187&ixlib=php 3.3 2021 Summer Boot CampMatt Larsen
Matt Larsen is a computer scientist at Lawrence Livermore National Laboratory. He received his Ph.D. in computer science from the University of Oregon in 2016. He is the primary developer for ECP-ALPINE's Ascent in situ library, as well as a key contributor to VTK-m and VisIt. Matt's research interests include rendering for visualization, performance modeling for visualization, and many-core architectures.

Organization Committee:

Lan Li (BSU)
Leslie Kerby (ISU)
Eric Jankowski (BSU)
Steven Cutchin (BSU)
Mahmood Mamivand (BSU)
Mendi Edgar (BSU)
Lawrence Spear (BSU)
Hillary K. Fishler (CAES)