Friday, May 09, 2008

Teaching Parallelism

Uzi Vishkin wrote these ideas on how to teach a parallel computing course as a comment on my earlier parallelism post.

The basic claim is that:

  • It does not make sense to have a new platform of general-purpose parallel computing succeed the established serial platform without having a one-to-one match of EVERYTHING, including algorithms and data structures.
  • In particular, it does not make sense to teach parallel programming without teaching parallel algorithms and data structures. The gap between programming and algorithms must be bridged, so that the continuum from algorithms and data-structures to programming will resemble as much as possible the continuum in serial computing.
  • Since the PRAM theory is the only serious candidate developed in nearly 3 decades of research, PRAM algorithms have got to be taught.
I expect theorists to endorse this argument and use it to convince their colleagues that PRAM algorithms need to be taught. But, I have to be frank. I am concerned that some of us will do the following: teach a course on parallel algorithms as a purely theory course WITHOUT any connection to programming. This will miss the point as it ignores the need to relate algorithms to programming. The Q&A at the end of this text elaborate further on the programming issue.

As others have implied, you can find several fine sources for PRAM algorithms. For this reason, my comments below mostly focus on a way to address the parallel programming issue:

  1. In class presentation.
    1. Read Section 2.1 entitled XMTC in FPGA-Based Prototype of a PRAM-On-Chip Processor. It reviews a modest extension to the C programming language called XMTC that allows PRAM-like programming. XMTC essentially adds only 2 basic commands to C: Spawn and PS (for prefix-sum).
    2. Devote a total of around 15-20 minutes similar to slides 37-39 in these slides to present XMTC. Slide 40 can guide a discussion.
  2. Supporting documentation. The students should then be referred to: the XMTC Manual and the XMTC tutorial.
  3. Programming assignments. Please look up under assignments on this course page.
  4. Running programming assignments. The UMD PRAM-On-Chip project is on track for public release by the end of June 2008 of:
    1. a cycle accurate simulator of the PRAM-On-Chip machine, and
    2. a compiler from XMTC to that machine.
    The will allow your students to run XMTC code on an emulated 64-processor PRAM-On-Chip machine. To remind you, a hardware prototype of such a machine (using FPGA technology) has been in use at UMD since January 2007. A compiler that translates XMTC to OpenMP will also be released, giving your students an alternative way to run their assignments.
Finally, please note that this type of programming cannot be too difficult. I have given a 1-day parallel algorithms tutorial to a dozen high school students in Fall 2007 and subsequently some of them managed to do on their own 8 programming assignments. In fact, the above link to programming assignments gives these 8 programming assignments. The only help the high school student got was one office hour per week by an undergraduate teaching assistant. They did not get any school credit for their work. Their participation was in the context of a computer club after completing their regular school work (8 periods per day).

If you are looking for code examples, you are welcome to write to me.

Here are some Q&A:

Q: I never learned parallel programming formally, but I picked up some ideas in my free time from Java/MPI/OpenMP/etc. How do any of these relate to XMTC parallel programming?

A: XMTC parallel programming is simpler and different.

Q: The problem of algorithms being taught independently of programming is present within the exclusively serial world. What would you say to the many theorists who are resistant to the idea of having a heavy programming component in their courses?

A: IMHO the serial case is completely different. Most students have experienced/learned serial programming BEFORE taking the serial algorithms course. This is NOT the case for parallel programming. My experience is that students learn best if parallel programming is coupled with parallel algorithms. The main difference is that the parallel algorithms course is where parallel programming should be FIRST taught. The reason is that parallelism requires introduction of some first principles representing an "alien culture" to students. In contrast, serial computing is related to: (i) mathematical induction, (ii) the way our brain instructs our body (as a single processor), etc. There is nothing out there that prepares us for parallel computing.

Q: What text do you use for teaching parallel algorithms?

A: I have been using my class notes.

Warm thanks to Adam Smith and Aravind Srinivasan for their helpful comments on an earlier draft of this text.

13 comments:

  1. What about teaching parallel algorithms using the circuit model instead of the PRAM model ?

    ReplyDelete
  2. I'm excited to see someone suggesting that teaching theory with a programming component is a good idea, at least for parallel programming. Though I think that some of the same arguments (and other good ones) also suggest doing at least some programming in a "serial" theory class also.

    ReplyDelete
  3. Guy Blelloch's Parallel Algorithms class seems to have a nice mix of algorithms and programming.

    ReplyDelete
  4. PRAM doesn't scale. Even serial machines today have memory access distributed into banks. Don't spend too much time on it.

    The backbone of most parallel algorithms today is parallel prefix. Aggregating and distributing information in time logarithmic to your number of processors is crucial.

    As for programming libraries MPI is *the* standard.

    As for theory, cover the classes NC and P-complete. Very few texts today point out that there are problems serial in nature.

    I would probably assign Vipin Kumar's book and Pacheco's MPI book.

    ReplyDelete
  5. There has unfortunately been a disconnect between the theory community and the community of people who actually engage in large-scale parallel computing applications. The primary reason for doing parallelism is for doing large-scale problems, and the architectures that have proved most successful in practice are the shared-nothing architectures. The reasons are primarily economic and physical - nobody has figured out how to build anything resembling a PRAM at large scale that will compete with shared-nothing architectures on price/performance or peak performance on real problems.

    Now that we are starting to see a lot of multi-core chips, there should be renewed interest in discussion of parallel algorithms in the PRAM model. This won't come close to the needs of the truly high performance computing world, but that doesn't mean it should be taught. It would probably be more useful to focus on aspects of parallel computing that downplay asymptotic analysis on PRAMS with large numbers of processors, because this is the area where the PRAM has been displaced by shared-nothing architectures.

    Perhaps there are opportunities for theoreticians to explain whether there are fundamental physical limitations on RAM architectures, or whether we just haven't invented the right hardware. I don't know the answer there, but there seems to be a disconnect between theory and practice on this subject.

    ReplyDelete
  6. There are important reasons why PRAMs failed as a model of parallelism, and Uzi fails to address them. One only needs to read the long of list of papers on alternatives to the PRAM to see why the PRAM was not implementable in practice.

    Twenty years later finally we have parallelism in practice in the form of multicore computers and they look nothing like the PRAMs of yore.

    ReplyDelete
  7. First, many thanks to Lance Fortnow for reposting my second set of comments (dated April 28) following his April 16, 2008 blog.
    I think that comments 4-6 above missed my first set of comments (dated April 17). This original set of comments provided the needed context. In particular, the 64-processor PRAM-On-Chip hardware prototype built at UMD http://www.umiacs.umd.edu/users/vishkin/XMT/CompFrontiers08.pdf was noted. This prototype finally showed that a machine that can look to the programmer like a PRAM can be built, and, in fact, the basic XMT (explicit multi-threaded) architecture can be scaled to 1000 on-chip processors.
    In any case, I am responding below to comments 4-6.
    Comment: Even serial machines today have memory access distributed into banks.
    Answer: The architecture goal is to build a machine that can look to the programmer like a PRAM. (Namely, PRAM is not an architecture. It is just a computation model, much like the serial RAM used in many serial algorithms textbooks.)
    A 1984 paper by Mehlhorn-Vishkin already showed how, using hashing, one can implement the PRAM on memory that is distributed into banks. In fact, the 64-processor PRAM-On-Chip machine built at UMD does exactly that (please see the bottom left of Figure 5 in http://www.umiacs.umd.edu/users/vishkin/XMT/CompFrontiers08.pdf ).
    Comment: As for programming libraries MPI is *the* standard.
    Answer: This is a terrible standard. Indeed, the NSF Blue-Ribbon Panel on Cyberinfrastructure wrote a few years ago: To many users programming existing parallel computers is “as intimidating and time consuming as programming in assembly language”. The full report is on the NSF web site.
    Comment: The primary reason for doing parallelism is for doing large-scale problems
    Answer: This was indeed true in the past. However, when your desktop will have hundreds of thousands of processors, its focus will not be large weather prediction programs, or anything similar to that. Instead, this will be the default machine that everybody uses. Every CS major will have to know how to program this machine and reason about algorithms and data structures for it.
    Comment: There are important reasons why PRAMs failed as a model of parallelism, and Uzi fails to address them.
    Answer: There is only one fatal reason for the failure to build machines that looked to the programmer like a PRAM in the 1990s: the bandwidth between processors and memories was limited. The programmer was forced to program around these limitations, in contrast to the PRAM abstraction. (By the way, the nightmare of doing such programming is nicely reflected in the above quote from the NSF Cyberinfrastructure Panel.) However, this problem goes away on-chip. The UMD PRAM-On-Chip project provided concrete evidence. A 9mm by 5mm chip using IBM 90nm technology prototyped the interconnect network; see the Hot Interconnects 2007 paper http://www.umiacs.umd.edu/users/vishkin/XMT/hotI07-paper.pdf See also a Design Automation 2008 paper: http://www.umiacs.umd.edu/users/vishkin/XMT/DAC08-proc.pdf
    In any case, the strongest answer is the 64-processor hardware PRAM-On-Chip prototype that we actually built.
    Comment: One only needs to read the long of list of papers on alternatives to the PRAM to see why the PRAM was not implementable in practice.
    Answers: 1. A PRAM-On-Chip machine is in use at UMD since January 2007. 2. This long list of papers actually demonstrates that numerous researchers were very motivated to find a replacement for the PRAM. However, as is widely known, no other parallel algorithmic model came even close to the wealth of PRAM knowledge. In other words, today’s reality is that a machine that can look to the programmer like a PRAM can be built, and no substitute to the PRAM is available.
    Comment: Twenty years later finally we have parallelism in practice in the form of multicore computers and they look nothing like the PRAMs
    Answer: Parallelism was available in practice twenty and ten years ago, as well. However, these were also way too difficult to program like the multicores of today. The essence of my April 17 set of comments, which were based on my panel presentation at IPDPS a day earlier, was that processor vendors bet the future of the field on the same architecture approaches that failed before.

    ReplyDelete
  8. There is only one fatal reason for the failure to build machines that looked to the programmer like a PRAM in the 1990s: the bandwidth between processors and memories was limited.

    There are other reasons. For one people saw that it was too costly to program them and hence it would be cheaper to buy a superfast serial computer or alternatively, build an easier to program distributed solution (say a la Google).

    However, these were also way too difficult to program like the multicores of today.

    Multicores today can easily be used to good advantage with Java threads (granted this won't work so well when we reach 16+ cores). Programming the PRAMs of yore was vastly more complicated with the assumption of tightly coupled parallelism and large number of processors.

    ReplyDelete
  9. Upgrading to a faster serial computer worked great till 2003. But, not anymore.
    PRAM algorithms seek shortening single-task completion time. Java threads are more typically used for handling different tasks concurrently. I beg to differ: my experience is that tightly coupled parallelism is much easier for the programmer than loosely coupled parallelism.
    I agree that the classic PRAM approach is tied to having a large number of processors. I felt that it is important to comment on this matter as some theorists may identify PRAM algorithms with NC (the class of problems that have a poly logarithmic time parallel algorithm using a polynomial number of processors). This indeed complicated matters, as it almost always required new algorithms that are completely different from the serial ones.
    In contrast, the PRAM approach that many, including me, advocate teaching (and is supported by the UMD XMT architecture) is based on algorithms whose total number of operations is not significantly larger than the best serial algorithm. Please see my class notes for more formal definitions and how to abstract hardware concepts such as the number of processor, but let me just note here that having a large number of (PRAM) processors is not a crucial condition for an algorithm to be effective. Furthermore, even a known serial algorithm coupled with parallel data structures can often provide a pretty effective parallel algorithm. For example, consider breadth-first search (BFS) on graphs, or even depth-first search (DFS) on graphs, as discussed in exercises 35 and 36 in my class notes.
    Overall, the XMT architecture seeks competitive performance (with any same-generation computer) on WHATEVER amount of parallelism the application provides. In particular, XMT provides serial compatibility (competitive performance on serial code with the strongest serial machine). Finally, the most important point for this posting is that the XMT-PRAM approach views (and teaches) parallel algorithms (and programming) as a natural extension of serial algorithms (and programming). Namely, a serial algorithm is merely the special case where only one operation takes place at a time.

    ReplyDelete
  10. my experience is that tightly coupled parallelism is much easier for the programmer than loosely coupled parallelism.

    Depending on what you mean we might agree or disagree on this. A vector scalar architecture GPU-style is indeed easier to program than a multi-thread model. A full MIMD, tightly-synched model is a lot harder to program.

    This indeed complicated matters, as it almost always required new algorithms that are completely different from the serial ones.

    Exactly. This substantially decreased the chances of adoption for the PRAM.

    Curiously, this is in relevant to the discussion of STOC/FOCS chasing technical difficulty for its own sake: The difficult PRAM algorithms lie near the vicinity of polynomial number of processor machines and hence that is what was studied.

    ReplyDelete
  11. I really appreciate the interest, but think that there is no substitute to reading the material reviewed in the original posting.
    In fact, the XMT-PRAM approach uses neither Vector nor MIMD. Instead, it builds on a known approach that lies somewhere between Vector and MIMD.

    ReplyDelete
  12. In fact, the XMT-PRAM approach uses neither Vector nor MIMD. Instead, it builds on a known approach that lies somewhere between Vector and MIMD.

    It does. In fact, it is sufficiently different to the PRAM model of old that I'm a bit surprised by the emphasis on PRAMs in your postings here. I think the models with a highest chance to succeed would cut loose substantial portions of the old PRAM model, such as your proposed move from fully synchronous MIMD to coarse-grained thread-based Single-Program-Multiple-Data machine.

    This is as much a PRAM as it is a BSP machine. Why such outward emphasis on fully embracing the old PRAM model when in reality something quite improved and substantially better is being proposed?

    - Since the PRAM theory is the only serious candidate developed in nearly 3 decades of research, PRAM algorithms have got to be taught.


    To sum up, yes parallelism has earned its way back into the curriculum but this time around the focus should be in newer models of parallelism, such as XMT-PRAM, BSP, Cilk, LoPRAM, Paraleap, Cuda, etc. The old PRAM should serve as inspiration and as a source of techniques, but there is no need to teach fresh minds the old model, warts an all, in an introductory course.

    ReplyDelete
  13. The characterization “coarse-grained” is not true and is the basis for the rest.
    May I please suggest that we schedule a time for a phone meeting in which I will try to address your questions? Please send me an e-mail to my last name at umd.edu to schedule a time.

    ReplyDelete