Projects - Spring 2019
Extreme scale data science at the convergence of big data and massively parallel computing is enabling simulation, modelling and real-time analysis of complex natural and social phenomena at unprecedented scales. The aim of the project is to gain practical experience into this interplay by applying parallel computation principles in solving a compute and data-intensive problem.
Your final project is to solve a data-intensive or a compute-intensive problem with parallel processing on the AWS cloud or on Harvard’s supercomputer: Odyssey (or both!). You will identify a compute or and data science problem, analyse its compute scaling requirement, collect the data, design and implement a parallel software, and demonstrate scaled performance of an end-to-end application.
Project Requirements: Consider that your project should:
- Demonstrate the need for big compute and/or big data processing, and what can be achieved thanks to large-scale parallel processing.
- Solve a problem for a non-trivial computation graph and with hierarchical parallelism.
- Be implemented on a distributed-memory architecture with either a many-core or a multi-core compute node, and evaluated on at least 8 compute nodes (note: each compute node on Odyssey is a multi-core with 32, or 64 cores or with a many-core GPU with hundreds of cores)
- Use a hybrid parallel program in either, for example: MPI + OpenMP, MPI + OpenACC (or OpenCL), Spark or MapReduce + OpenACC (or OpenCL) or MPI + Spark or MapReduce
- Be evaluated on large data sets or problem sizes to demonstrate both weak and strong scaling using appropriate metrics (throughput, efficiency, iso-efficiency...).
Group Size: Students are required to form teams and to partition the work among the team members. The final project must be done in teams with 4 students each (exceptions by permission of the instructor). You can use the course forum to find prospective team members. You may also find and discuss project ideas on the forum. In general, we do not anticipate that the grades for each group member will be different. However, we reserve the right to assign different grades to each group member if it becomes apparent that one of them put in a vastly different amount of effort than the others.
Project Milestones: There are five milestones for your final project. It is critical to note that no extensions will be given for any of these milestones for any reason. Projects submitted after the final due date will not be graded.
- Team formation (3/29 - Tentative)
- In-class presentation of your project proposal (4/16 - Tentative)
- In-class presentation of your progress with the design of the project (4/25 - Tentative)
- The final project deliverables submission (5/15 - Tentative)
- Project presentation to teaching staff (5/16 - Tentative)
Project Proposal: Your group needs to present a project proposal (and submit the PDF of the presentation) to the teaching staff with the following sections:
- What is the problem you are trying to solve with this application?
- What is the need for big compute and/or big data processing and what can be achieved thanks to large-scale parallel processing?
- Describe your model and/or data in detail: where does it come from, what does it mean, etc.
- Which tools and infrastructures you are planning to use to build the application?
The aim of the session is to review each proposal and we may suggest modifications if necessary. Our main concern is the amount of effort a given project will require; either too much or too little is unacceptable. You will have 5, and ONLY 5, minutes to briefly summarize your proposal followed by 5 minutes of discussion time. You have to prepare 2-3 slides for your proposal. We will enforce the 5-minute time limit. This presentation is a chance for you to get feedback.
Project Progress (Design): Your group needs to present a project progress (and submit the PDF of the presentation) to the teaching staff covering the main aspects in the design of the parallel application with the following sections:
- Define the type of your application, the levels of parallelism exploited, the types of parallelism within the application, and the parallel execution model that will be used to build the parallel application
- If code exists, provide an analysis/profiling of the existing sequential/parallel code
- Describe the main overheads (communication, synchronization, load balancing and sequential sections) in the parallelization and techniques that will be applied to mitigate them
- Describe the numerical complexity of the algorithm
- Estimate the theoretical speed-up and scalability expected
The aim of the session is to briefly summarize your progress with a focus on the design part. You will have 5, and ONLY 5, minutes to briefly summarize your proposal. You have to prepare 2-3 slides for your progress. We will enforce the 5-minute time limit. This presentation is a chance for you to get feedback from the teaching staff, and to come up with ways around roadblocks you encounter. It is also a chance for the staff to ensure that your project is on track, and that your project is still in the appropriate-amount-of-work range.
- Web site as final report
- Software with evaluation data sets, test cases (on Github repo)
- Presentation to the teaching staff
Project Web Site: An important piece of your final project is a public web site that describes all the great work you did for your project. The web site serves as the final project report, and needs to describe your complete project. You can use GitHub Pages, or the README file on the GitHub repository, so you can easily refer to the software at the GitHub repository. You should assume the reader has no prior knowledge of your project and has not read your proposal. It should address the following aspects:
- Description of problem and the need for HPC and/or Big Data
- Description of solution and comparison with existing work on the problem
- Description of your model and/or data in detail: where did it come from, how did you acquire it, what does it mean, etc.
- Technical description of the parallel application, programming models, platform and infrastructure
- Links to repository with source code, evaluation data sets and test cases
- Technical description of the software design, code baseline, dependencies, how to use the code, and system and environment needed to reproduce your tests
- Performance evaluation (speed-up, throughput, weak and strong scaling) and discussion about overheads and optimizations done
- Description of advanced features like models/platforms not explained in class, advanced functions of modules, techniques to mitigate overheads, challenging parallelization or implementation aspects...
- Final discussion about goals achieved, improvements suggested, lessons learnt, future work, interesting insights…
Your web page should include screenshots of your software that demonstrate how it functions. You should include a link to your source code.
Project Software: Your final project can be implemented using any API or programming language you would like. Make your own repository on GitHub with a link to your project web page. Software with evaluation data sets, test cases should be available on the repo. Include a README that describes the code and application files, and how your program should be run. We will be grading these projects on a variety of platforms, so you must include detailed instructions on how to run or compile your code. If we cannot run your application from the instructions included with your submission, we will not be able to grade this portion of your project. Your performance results should be reproducible, so you should provide all the information of the system and the environment needed to reproduce your tests.
Project Presentations: You will have 10, and ONLY 10, minutes to briefly present your project followed by 5 minutes of discussion time. You may prepare 4-5 slides for your summary, but we will enforce the 10-minute time limit. Focus the majority of your presentation on your main contributions rather than on technical details. What do you feel is the coolest part of your project? What insights did you gain? What is the single most important thing you would like to show the class? Upload the presentation to the GitHub repo and on Canvas.
Project Grading: The final project grades are dependent on the following criteria:
- Attempted difficulty: Some projects are harder than others. For example, an assignment based off of one of the homework assignments is probably easier than a completely new application.
- Did you meet your major goals? The most important grading criteria is functionality: A working program will always garner the majority of available points; no credit will be given for non-working programs. A modest solution that works will be graded much more favorably than an ambitious "solution" that core dumps!
Project will be graded on the depth of work undertaken and communication (web site, presentation):
- 10%: Project proposal (in-class presentation)
- 20%: Project design (in-class presentation)
- 10%: Problem description and comparison with existing work
- 10%; Parallel application, programming model, platform and infrastructure
- 10%: Software source code, design and documentation
- 10%: Performance evaluation and discussion
- 20%: Advanced features
- 10%: Final discussion and deliverables quality