Parallel Libraries to support High-Level Programming

Morten Nørgaard Larsen

Parallel Libraries to support High-Level Programming

Morten Nørgaard Larsen

Abstract

The development of computer architectures during the last ten years have forced programmers to move towards writing parallel programs instead of sequential ones. The homogenous multi-core architectures from the major CPU producers like Intel and AMD has led this trend, but the introduction of the more exotic though short-lived heterogeneous CELL Broadband Engine (CELL-BE) architecture added to this shift. Furthermore, the use of cluster computers made of commodity hardware and specialized supercomputers have greatly increased in both industry as well as in the academic world. Finally, the general increase in the usage of graphic cards for general-purpose programming (GPGPU) have meant that programmers today must be able to write parallel programs that cannot only utilize small number computational cores but perhaps hundreds or even thousands. However, most programmers will agree that doing so is not a simple task and for many non-computer scientists, like chemists and physicists writing programs for simulating their experiments, the task can easily become overwhelming.

During the last decades, a lot of research efforts have been put into how to create tools that will simplify writing parallel programs by raising the abstraction level, so that the programmers can focus on implementing their algorithms and not on the details of the underlying hardware. The outcome has ranged from ideas on of automating the parallelization of sequential programs to presenting a cluster of machines as they would be a single machine. In between is a number of tools helping the programmers handle communication, share data, run loops in parallel, handle algorithms mining huge amounts of data etc. Even though most of them do a good job performance-wise, almost all of them require that the programmers learn a new programming language or at least forces them to learn new methods and/or ways of writing code.

For the first part, this thesis will focus on simplifying the task of writing parallel programs for programmers, but especially for the large group of noncomputer scientists. I will start by presenting an extension based on Communicating Sequential Processes (CSP) for a distributed shared memory system for the CELL Broadband Engine. This extension consists of a channel model and of a thread library for the CELL’s specialized computational units enabling them to run multiple (CSP) processes. Overall, the CSP model requires the programmer to think a bit differently, but at the same time the implemented algorithms will perform very well, as shown by the initial tests presented.

In the second part of this thesis, I will change focus from the CELL-BE architecture to the more traditionally x86 architecture and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can run parts of a program like a loop in parallel without directly programming the underlying hardware. The presented tool will be able to run the body of the method in parallel including handling the consistency of any shared data which the programmer accesses within the loop body. Doing so includes implementing a distributed shared memory system along with the MESI protocol on top of the .NET framework. However, during the implementation and while testing, it became clear that the lack of information regarding what shared data a method accesses greatly limits the overall performance. Moreover, the overhead of building a distributed shared memory system along with a consistency model on top of .NET became too large. Therefore, the work is repeated in another approach, which will force programmers to define what data a method will access when executed. Inspired by CSP, I define a set of rules dictating how programmers should write a method including input parameters, output values, and accesses of shared data. These rules make it possible to get more information, which in turns allows us to build a new tool system that does not need a distributed shared memory system or a consistency model. However, programmers can still invoke methods, and the tool will transparently run the method in parallel on a platform consisting of workstations, servers, and cloud instances. Overall, this increases the effort required by the programmers but greatly improves performance, as the initial tests shows.

Original language	English

Publisher	The Niels Bohr Institute, Faculty of Science, University of Copenhagen
Number of pages	133
Publication status	Published - 2013

Access to Document

Morten Nørgaard LarsenAccepted author manuscript, 1.98 MB

Cite this

@phdthesis{6c7dd3847ffd437c954eced45bd92bfb,

title = "Parallel Libraries to support High-Level Programming",

abstract = "The development of computer architectures during the last ten years have forced programmers to move towards writing parallel programs instead of sequential ones. The homogenous multi-core architectures from the major CPU producers like Intel and AMD has led this trend, but the introduction of the more exotic though short-lived heterogeneous CELL Broadband Engine (CELL-BE) architecture added to this shift. Furthermore, the use of cluster computers made of commodity hardware and specialized supercomputers have greatly increased in both industry as well as in the academic world. Finally, the general increase in the usage of graphic cards for general-purpose programming (GPGPU) have meant that programmers today must be able to write parallel programs that cannot only utilize small number computational cores but perhaps hundreds or even thousands. However, most programmers will agree that doing so is not a simple task and for many non-computer scientists, like chemists and physicists writing programs for simulating their experiments, the task can easily become overwhelming. During the last decades, a lot of research efforts have been put into how to create tools that will simplify writing parallel programs by raising the abstraction level, so that the programmers can focus on implementing their algorithms and not on the details of the underlying hardware. The outcome has ranged from ideas on of automating the parallelization of sequential programs to presenting a cluster of machines as they would be a single machine. In between is a number of tools helping the programmers handle communication, share data, run loops in parallel, handle algorithms mining huge amounts of data etc. Even though most of them do a good job performance-wise, almost all of them require that the programmers learn a new programming language or at least forces them to learn new methods and/or ways of writing code. For the first part, this thesis will focus on simplifying the task of writing parallel programs for programmers, but especially for the large group of noncomputer scientists. I will start by presenting an extension based on Communicating Sequential Processes (CSP) for a distributed shared memory system for the CELL Broadband Engine. This extension consists of a channel model and of a thread library for the CELL{\textquoteright}s specialized computational units enabling them to run multiple (CSP) processes. Overall, the CSP model requires the programmer to think a bit differently, but at the same time the implemented algorithms will perform very well, as shown by the initial tests presented. In the second part of this thesis, I will change focus from the CELL-BE architecture to the more traditionally x86 architecture and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can run parts of a program like a loop in parallel without directly programming the underlying hardware. The presented tool will be able to run the body of the method in parallel including handling the consistency of any shared data which the programmer accesses within the loop body. Doing so includes implementing a distributed shared memory system along with the MESI protocol on top of the .NET framework. However, during the implementation and while testing, it became clear that the lack of information regarding what shared data a method accesses greatly limits the overall performance. Moreover, the overhead of building a distributed shared memory system along with a consistency model on top of .NET became too large. Therefore, the work is repeated in another approach, which will force programmers to define what data a method will access when executed. Inspired by CSP, I define a set of rules dictating how programmers should write a method including input parameters, output values, and accesses of shared data. These rules make it possible to get more information, which in turns allows us to build a new tool system that does not need a distributed shared memory system or a consistency model. However, programmers can still invoke methods, and the tool will transparently run the method in parallel on a platform consisting of workstations, servers, and cloud instances. Overall, this increases the effort required by the programmers but greatly improves performance, as the initial tests shows.",

author = "Larsen, {Morten N{\o}rgaard}",

year = "2013",

language = "English",

publisher = "The Niels Bohr Institute, Faculty of Science, University of Copenhagen",

}

TY - BOOK

T1 - Parallel Libraries to support High-Level Programming

AU - Larsen, Morten Nørgaard

PY - 2013

Y1 - 2013

N2 - The development of computer architectures during the last ten years have forced programmers to move towards writing parallel programs instead of sequential ones. The homogenous multi-core architectures from the major CPU producers like Intel and AMD has led this trend, but the introduction of the more exotic though short-lived heterogeneous CELL Broadband Engine (CELL-BE) architecture added to this shift. Furthermore, the use of cluster computers made of commodity hardware and specialized supercomputers have greatly increased in both industry as well as in the academic world. Finally, the general increase in the usage of graphic cards for general-purpose programming (GPGPU) have meant that programmers today must be able to write parallel programs that cannot only utilize small number computational cores but perhaps hundreds or even thousands. However, most programmers will agree that doing so is not a simple task and for many non-computer scientists, like chemists and physicists writing programs for simulating their experiments, the task can easily become overwhelming. During the last decades, a lot of research efforts have been put into how to create tools that will simplify writing parallel programs by raising the abstraction level, so that the programmers can focus on implementing their algorithms and not on the details of the underlying hardware. The outcome has ranged from ideas on of automating the parallelization of sequential programs to presenting a cluster of machines as they would be a single machine. In between is a number of tools helping the programmers handle communication, share data, run loops in parallel, handle algorithms mining huge amounts of data etc. Even though most of them do a good job performance-wise, almost all of them require that the programmers learn a new programming language or at least forces them to learn new methods and/or ways of writing code. For the first part, this thesis will focus on simplifying the task of writing parallel programs for programmers, but especially for the large group of noncomputer scientists. I will start by presenting an extension based on Communicating Sequential Processes (CSP) for a distributed shared memory system for the CELL Broadband Engine. This extension consists of a channel model and of a thread library for the CELL’s specialized computational units enabling them to run multiple (CSP) processes. Overall, the CSP model requires the programmer to think a bit differently, but at the same time the implemented algorithms will perform very well, as shown by the initial tests presented. In the second part of this thesis, I will change focus from the CELL-BE architecture to the more traditionally x86 architecture and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can run parts of a program like a loop in parallel without directly programming the underlying hardware. The presented tool will be able to run the body of the method in parallel including handling the consistency of any shared data which the programmer accesses within the loop body. Doing so includes implementing a distributed shared memory system along with the MESI protocol on top of the .NET framework. However, during the implementation and while testing, it became clear that the lack of information regarding what shared data a method accesses greatly limits the overall performance. Moreover, the overhead of building a distributed shared memory system along with a consistency model on top of .NET became too large. Therefore, the work is repeated in another approach, which will force programmers to define what data a method will access when executed. Inspired by CSP, I define a set of rules dictating how programmers should write a method including input parameters, output values, and accesses of shared data. These rules make it possible to get more information, which in turns allows us to build a new tool system that does not need a distributed shared memory system or a consistency model. However, programmers can still invoke methods, and the tool will transparently run the method in parallel on a platform consisting of workstations, servers, and cloud instances. Overall, this increases the effort required by the programmers but greatly improves performance, as the initial tests shows.

AB - The development of computer architectures during the last ten years have forced programmers to move towards writing parallel programs instead of sequential ones. The homogenous multi-core architectures from the major CPU producers like Intel and AMD has led this trend, but the introduction of the more exotic though short-lived heterogeneous CELL Broadband Engine (CELL-BE) architecture added to this shift. Furthermore, the use of cluster computers made of commodity hardware and specialized supercomputers have greatly increased in both industry as well as in the academic world. Finally, the general increase in the usage of graphic cards for general-purpose programming (GPGPU) have meant that programmers today must be able to write parallel programs that cannot only utilize small number computational cores but perhaps hundreds or even thousands. However, most programmers will agree that doing so is not a simple task and for many non-computer scientists, like chemists and physicists writing programs for simulating their experiments, the task can easily become overwhelming. During the last decades, a lot of research efforts have been put into how to create tools that will simplify writing parallel programs by raising the abstraction level, so that the programmers can focus on implementing their algorithms and not on the details of the underlying hardware. The outcome has ranged from ideas on of automating the parallelization of sequential programs to presenting a cluster of machines as they would be a single machine. In between is a number of tools helping the programmers handle communication, share data, run loops in parallel, handle algorithms mining huge amounts of data etc. Even though most of them do a good job performance-wise, almost all of them require that the programmers learn a new programming language or at least forces them to learn new methods and/or ways of writing code. For the first part, this thesis will focus on simplifying the task of writing parallel programs for programmers, but especially for the large group of noncomputer scientists. I will start by presenting an extension based on Communicating Sequential Processes (CSP) for a distributed shared memory system for the CELL Broadband Engine. This extension consists of a channel model and of a thread library for the CELL’s specialized computational units enabling them to run multiple (CSP) processes. Overall, the CSP model requires the programmer to think a bit differently, but at the same time the implemented algorithms will perform very well, as shown by the initial tests presented. In the second part of this thesis, I will change focus from the CELL-BE architecture to the more traditionally x86 architecture and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can run parts of a program like a loop in parallel without directly programming the underlying hardware. The presented tool will be able to run the body of the method in parallel including handling the consistency of any shared data which the programmer accesses within the loop body. Doing so includes implementing a distributed shared memory system along with the MESI protocol on top of the .NET framework. However, during the implementation and while testing, it became clear that the lack of information regarding what shared data a method accesses greatly limits the overall performance. Moreover, the overhead of building a distributed shared memory system along with a consistency model on top of .NET became too large. Therefore, the work is repeated in another approach, which will force programmers to define what data a method will access when executed. Inspired by CSP, I define a set of rules dictating how programmers should write a method including input parameters, output values, and accesses of shared data. These rules make it possible to get more information, which in turns allows us to build a new tool system that does not need a distributed shared memory system or a consistency model. However, programmers can still invoke methods, and the tool will transparently run the method in parallel on a platform consisting of workstations, servers, and cloud instances. Overall, this increases the effort required by the programmers but greatly improves performance, as the initial tests shows.

UR - https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01009124267&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

M3 - Ph.D. thesis

BT - Parallel Libraries to support High-Level Programming

PB - The Niels Bohr Institute, Faculty of Science, University of Copenhagen

ER -

Parallel Libraries to support High-Level Programming

Abstract

Access to Document

Other files and links

Cite this