Contents - Spring 2019

Computational science has become a third partner, together with theory and experimentation, in advancing scientific knowledge and practice, and an essential tool for product and process development and manufacturing in industry. Big data science adds the ‘fourth pillar’ to scientific advancements, providing the methods and algorithms to extract knowledge or insights from data.

The course is a journey into the foundations of Parallel Computing at the intersection of large-scale computational science and big data analytics. Many science communities are combining high performance computing and high-end data analysis platforms and methods in workflows that orchestrate large-scale simulations or incorporate them into the stages of large-scale analysis pipelines for data generated by simulations, experiments, or observations.

This is an applications course highlighting the use of modern computing platforms in solving computational and data science problems, enabling simulation, modeling and real-time analysis of complex natural and social phenomena at unprecedented scales. The class emphasizes on making effective use of the diverse landscape of programming models, platforms, open-source tools, computing architectures and cloud services for high performance computing and high-end data analytics.

In the last decade we have seen many technology developments and provision paradigm shifts that have changed the way to design, develop, share and execute applications.

As scientists and engineers, we have to dig deeper than buzzwords. Behind these rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular platform you are using. The course focuses on teaching those principles, the position where programming models and platforms fit in, how to make good use of them, and how to avoid their pitfalls. The course does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation covering those topics. Instead we discuss the various principles and trade-offs that are fundamental in using modern computing platforms for computational and data science.


Summary

Computational Science and Data Science applications can be compute-intensive, also known as big compute, or data-intensive, also known as big data. Whereas compute-intensive applications bring the data to the compute, data-intensive applications do the opposite by bringing the compute to the data.

Compute-intensive applications consist of a large number of independent (HTC: High Throughput Computing) or parallel (HPC: High Performance Computing) tasks that are performed simultaneously to address a particular part of the problem. Compute power is the main bottleneck and the developer has to use one of the existing parallel programming models (mainly DRMAA for multi-task, OpenMP for shared-memory, or MPI for distributed-memory) to decompose the application into tasks and define their communication and synchronization.

Data-intensive applications, on the other hand, usually require the same computation to be applied to large volumes of data. The communication and storage of the data become the main bottleneck and the developer has to use one of the existing data processing models (mainly MapReduce/Hadoop and Spark) to partition the data into multiple segments for their parallel processing using the same task and the subsequent combination of the intermediate results in multiple stages.

The course gives a practical overview of parallel programming models for compute- and data-intensive applications, and the existing platforms, open-source tools and cloud services to support their execution. The design of efficient parallel programs is approached as an algorithm-architecture problem meaning that a solution is architecture dependent. After the course, you will be in a great position to decide which kind of programming model and platform is appropriate for which purpose, and understand how tools should be combined to form the foundation of the right application architecture to meet the scalability and performance requirements of the most demanding computational problems.


Outline

Introduction. Large-Scale Computational and Data Science

Why we need performance and parallel processing in Computational and Data Science applications, and an overview of the course.

A. Parallel Processing Fundamentals

The different application execution profiles and forms of parallel computing, modern computing architectures and provisioning models for scalable large-scale processing, and aspects to consider in the design, development and distribution of parallel codes.

A.1. Parallel Processing Architectures

A.2. Large-scale Processing on the Cloud

A.3. Practical Aspects of Cloud Computing

A.4. Application Parallelism

A.5. Designing Parallel Programs

B. Parallel Computing

Techniques for code optimization and accelerated computing, and efficient execution of a large number of loosely-coupled tasks. Parallel processing of compute-intensive applications on shared and distributed-memory systems

B.1. Foundations of Parallel Computing

B.2. Performance Optimization

B.3. Accelerated Computing

B.4. Shared-memory Parallel Processing

B.5. Distributed-memory Parallel Processing

C. Parallel Data Processing

Parallel processing of data-intensive applications.

C.1. Foundations of Data Processing

C.2. Batch Data Processing

C.3. Dataflow Processing

C4. Stream Data Processing

D. Advanced Topics

Summary of the course and future directions