Wednesday 10 February 2016

Notes on Concurrency and Distribution: 101

While reading up on distributed systems, I couldn't help noticing the amount of overlap between the ideas of distribution, concurrency and parallelism. At a point of discussion with some friends, it almost seemed like they meant the same thing. In this post, I'm going study the background necessary to understand the differences and lay a foundation for understanding the relationship between concurrency and distribution.

Distributed, Concurrent and Parallel

Here's one of my favorite statements to start with:

"the processors in a typical distributed system run concurrently in parallel" - Wikipedia

Interesting? Indeed.

Let's begin with distributed systems: A system consisting of software components located on networked computers. These components have their own individual private memory and work towards achieving the system's goal. When working towards a common goal, there arises a need to communicate with each other for various reasons (like reporting a status of a task execution, sending an output, triggering an event etc.), and this is facilitated by the network.

Concurrency is the property of a system in which multiple computational tasks are in execution state at overlapping periods of time. They may even start and complete within the overlapping time period.

On the other hand, a parallel system executes a set of computational tasks simultaneously i.e, they run exactly at the same time in different execution points. The execution point (or thread of control as Wikipedia puts it) could refer to a processor or even a computer as a whole. 

Confusion mainly arises between the latter two. A few examples did the trick for me:
  • parallel processes can't be executed on the same execution point, say a processor. While, on the other hand, the concurrent execution of two processes can take place on the same processor by context switching.
  • a concurrent computation MAY be executed in parallel.   
I bring in parallelism here just to get the basic understanding of the differences right. We will continue to focus on concurrency and distribution.

A problem with concurrent execution of computational tasks is keeping the outcome deterministic i.e, if parts of the tasks are run in a different order, they must produce the same outcome. When the processes communicate with each other, the order in which they communicate cannot easily be established as we don't have control over the order of execution (in context switching for example), hence there is an explosion in the number possible execution paths of the system and which one exactly isn't known - this introduces non-deterministic behavior in the system. We need a framework that allows us to think about and approach communication in a deterministic manner without having to handle too many constraints. What better place to introduce this than the programming language itself? More on this later.

Communication

Communication, in general, can be implemented in different ways.

Communication between concurrent processes on the same machine can be achieved by writing and reading from the same memory block - called shared memory communication. I find this approach very troubling and difficult to handle. The programmer has to explicitly go through the difficulty of locking, synchronizing and making the program thread safe.

Another method is message passing: the processes send messages to each other. For example, a message can be sent from one process to another by opening a socket.

In the case of distributed systems, shared memory doesn't apply as all the components have their own individual private memory. Which leaves us with passing messages between the components.

Although, in practice, while building a distributed system, we often simplify communication by mocking a shared memory using a centralized data-store and building stateless components. In this case, I think, we avoid the problem, not solve it. I am not sure if such a system can be called "truly distributed". What happens to the system when the components cannot connect to the centralized database due to a network failure?

Such a system trades some fault tolerance for simplicity of the system and may not be suitable for all use cases. Additionally, it brings in the complexity of making the components stateless and hence having to query the database often - which would also affect the performance of the system. A centralized registry can be used as a backup store to restore the state of the system in case of an partial or universal failure. But, using it as the primary mode of communication doesn't seem to be the best option for the reasons stated above.

Along with messaging, arise a number of requirements such as a messaging protocol, message delivery mechanism, component discovery etc. These are topics we will eventually be discovering in depth. There are a number of open questions here: What about run time safety in distributed systems? Can we identify protocol violations while implementing the system i.e, before run time? How do we debug errors efficiently?

As per my current knowledge, it is my understanding that communication (more specifically message passing) is what brings these two fields together. Being two different fields, they would (probably) have different approaches and terminology for a similar problem - both theoretically and practically. It would be fruitful to study both these fields and especially their intersection. In the process I hope to study relevant formal models and evaluate various tools such as Erlang, Lasp lang, Vert.x, Akka, Scribble, Mungo and other similar ones which come up as a part of the study or via suggestions.

References:

1 comment:

  1. Had a bad time in searching and finding the content on these topics in details for the assignment. Thanks a lot for sharing the info and the referring links as well.

    ReplyDelete