Parallel computation: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Niek Sanders
(Added citation for Semaphores.)
imported>Niek Sanders
(Added notes on semaphore semantics.)
Line 26: Line 26:


====Low Level Primitives====
====Low Level Primitives====
* Semaphores
* [[semaphore|Semaphores]]
** Introduced by Dijkstra.  <ref name=Dijkstra67>Dijkstra, E. W. 1967. The structure of the “the”-multiprogramming system. In Proceedings of the First ACM Symposium on Operating System Principles J. Gosden and B. Randell, Eds. SOSP '67. ACM Press, New York, NY, 10.1-10.6. DOI= http://doi.acm.org/10.1145/800001.811672</ref>
** Introduced by Dijkstra.  <ref name=Dijkstra67>Dijkstra, E. W. 1967. The structure of the “the”-multiprogramming system. In Proceedings of the First ACM Symposium on Operating System Principles J. Gosden and B. Randell, Eds. SOSP '67. ACM Press, New York, NY, 10.1-10.6. DOI= http://doi.acm.org/10.1145/800001.811672</ref>
* Mutexes
** Normally represented by an integer initialized to either 0 or 1
** P operation - decreases value of semaphore by one; if result is nonnegative, process may continue; if result is negative, process stopped and added to process waiting list for semaphore
** V operation - increases value of semaphore by one; if result is positive, no further effect; if result is nonpositive, one of the processes on semaphore waiting list is removed from list (and this removed process may proceed with its execution).
** Supported in [[POSIX]] standard
* Monitors
* Monitors
* Special purpose languages for parallel computation (Occam)
* Special purpose languages for parallel computation ([[Occam]])
* Pure [[functional language (programming)|functional languages]]
* Pure [[functional language (programming)|functional languages]]



Revision as of 17:34, 16 April 2007

In parallel computation, multiple code (programming) instructions are executed at the same time. This parallelism can range from coarse to fine grained. An example of coarse parallelism is a ray tracer rendering a separate output pixel on each available CPU. An example of fine grained parallelism is found in multiplying an n-dimensional vector by a scalar. The vector is cut up and a chunk is distributed to each CPU, the multiplication is done on each chunk, and the result is reassembled.

These examples illustrate key limitations to parallel computing. First, there is the overhead of distributing and reassembling the computation. Second, there is the challenge of developing parallel algorithms. In the ideal case, such as in a ray tracer, elements of the problem (pixels to be rendered) do not depend on one another. That is, the computing one output pixel does not require information or affect any of the other output pixels. These uncoupled problems are referred to as Embarrassingly Parallel. Many problems do not fit in to this category, an example being the classic N-body.

Parallel algorithms and programs are significantly more difficult to write and debug. However, as multicore processors (many armed with SIMD commands) become increasingly common, parallel computation is finding its way in to more mass-consumer code [1]. (e.g. mp3 encoders using SIMD, photoshop multiprocessor support, etc).

Note that it is generally possible to run parallel code on a single processor. Multiple processors are simulated through context switching. When running on a single processor, parallel algorithms typically have worse performance than their serial analogues, mainly because of the overhead of the distribution/communication logic.


Problem Domain

Algorithms

Hardware

Typically the CPU must offer some atomic operations to allow for parallel code to be written. There is also a famous algorithm which takes advantage of memory interlock to allow for mutual exclusion.

  • Test and Set (TAS)
  • Compare and Swap (for Lockfree programming) [2]
  • SIMD, MIMD instructions
    • SIMD: Intel SSE3 instructions. Operate on 128-bit register which can hold 4-floats/2-doubles/etc.
  • Memory Interlock
    • Dekker's Algorithm [3]

Software

Low Level Primitives

  • Semaphores
    • Introduced by Dijkstra. [4]
    • Normally represented by an integer initialized to either 0 or 1
    • P operation - decreases value of semaphore by one; if result is nonnegative, process may continue; if result is negative, process stopped and added to process waiting list for semaphore
    • V operation - increases value of semaphore by one; if result is positive, no further effect; if result is nonpositive, one of the processes on semaphore waiting list is removed from list (and this removed process may proceed with its execution).
    • Supported in POSIX standard
  • Monitors
  • Special purpose languages for parallel computation (Occam)
  • Pure functional languages

Compiler Support

  • Auto vectorizing compilers
    1. Compiler attempts to analyze loops to see which ones can be done in parallel
    2. Exists in Intel and GCCv4 compilers
  • Speciality languages for Parallel programming

Library Support

  • OpenMP
    1. Open standard
    2. Can be used to implement fine grained parallelism
    3. Built in support in many commercial compilers (Intel C/C++/FORTRAN, GCCv4, Sun CC)
    4. Uses meta-code to annotate which parts of a C or FORTRAN program can be parallelized.
    5. In C this is accomplished through the use of preprocessor pragmas
    6. Pragmas indicate things like "execute loop in parallel" or this variable has no dependency between iterations.
  • Message Passing Interface (MPI)
    1. Well suited to cluster computing.
    2. Major library is LAM/MPI. Some vendors (e.g. Sun) also roll their own, optimized versions.

Research

A major recent development has been the advancement of Lock-Free programing. This allows for the development of thread-safe code with out the use of locking mechanisms such as semaphores [2]. While implementing correct lock-free algorithms is considered extremely difficult, they show significant performance advantage over their traditional counterparts. For example, the lock-free dynamic memory allocator implemented in [5] shows up to a 331% speedup over a traditional lock-based allocator when operating on 16 processors under maximum contention. This lock-free allocator also offers better scalability and latency than traditional allocators.

Related Topics

Citations

  1. Sutter, Herb. "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal Mar. 2005. <http://www.gotw.ca/publications/concurrency-ddj.htm>.
  2. 2.0 2.1 Alexandrescu, Andrei. "Lock-Free Data Structures." C/C++ Users Journal 1 Sept. 2004. 27 Mar. 2007 <http://www.ddj.com/dept/cpp/184401865>.
  3. E.W. Dijkstra, Cooperating Sequential Processes (Techniche Hogeschool, Eindhoven, 1965). Reprinted in: F. Genuys (ed.), Programming Languages, Academic Press, 1968, 43--112.
  4. Dijkstra, E. W. 1967. The structure of the “the”-multiprogramming system. In Proceedings of the First ACM Symposium on Operating System Principles J. Gosden and B. Randell, Eds. SOSP '67. ACM Press, New York, NY, 10.1-10.6. DOI= http://doi.acm.org/10.1145/800001.811672
  5. Michael, M. M. 2004. "Scalable lock-free dynamic memory allocation". SIGPLAN Not. 39, 6 (Jun. 2004), 35-46. DOI= http://doi.acm.org/10.1145/996893.996848