Thursday, 11 February 2016

Notes on Concurrency and Distribution: Programming Languages

In my previous posts, I had mentioned programming languages multiple times as the right place to implement solutions to concurrency and distribution. I'm going to elaborate on this a bit more.

A programming language is a tool which provides powerful abstractions to build software systems. Much of our approach while building systems is guided by the languages that we know or use. For example, there is a significant difference between the people who use a functional language vs people who use an imperative language.

Functional programmers seem to have a data driven approach i.e, they think about how the input data can be transformed to generate the desired output. They use functions as black-box entities to transform the data, and the program as a whole is a function. And, by definition, a function must return the same output every time for a given input - this encourages the programmer to write stateless programs i.e, the programmer doesn't use many intermediate data structures which represent the current state of the program's execution.

While on the other hand, imperative programmers approach the problem as a sequence of steps which need to be performed on the given data to generate the desired output. Here, the programmer often achieves this by storing the result of a certain sequence of instructions in an intermediate data structure, which can further be used in the remaining sequence of instructions.

The key idea, is to realize that programming languages, to a large extent, affect the way programmers break down a given problem into sub-problems. More details on this specific topic can be found on this paper.

This is precisely the reason I (and many more people) think a programming language is the most effective way to provide abstractions to effectively build concurrent and distributed systems. But this is also where the problem lies, as most widely used languages are designed primarily for sequential computation and merely "support concurrency" (let's not even get into distribution).

It is not that these languages cannot be used to build distributed systems (people have done that), rather, it is very difficult to effectively build and maintain distributed systems in such languages (the same people will tell you that). It would be great to have the intricacies involved hidden from the programmer and implemented (and hence natively supported) by the language itself. When was the last time you said "Oh, I forgot to deallocate that memory block" ? (I feel sorry for you if you have an immediate answer).

More on the specifics later.

p.s. Yes, Erlang folks, I hear you.

Wednesday, 10 February 2016

Notes on Concurrency and Distribution: 101

While reading up on distributed systems, I couldn't help noticing the amount of overlap between the ideas of distribution, concurrency and parallelism. At a point of discussion with some friends, it almost seemed like they meant the same thing. In this post, I'm going study the background necessary to understand the differences and lay a foundation for understanding the relationship between concurrency and distribution.

Distributed, Concurrent and Parallel

Here's one of my favorite statements to start with:

"the processors in a typical distributed system run concurrently in parallel" - Wikipedia

Interesting? Indeed.

Let's begin with distributed systems: A system consisting of software components located on networked computers. These components have their own individual private memory and work towards achieving the system's goal. When working towards a common goal, there arises a need to communicate with each other for various reasons (like reporting a status of a task execution, sending an output, triggering an event etc.), and this is facilitated by the network.

Concurrency is the property of a system in which multiple computational tasks are in execution state at overlapping periods of time. They may even start and complete within the overlapping time period.

On the other hand, a parallel system executes a set of computational tasks simultaneously i.e, they run exactly at the same time in different execution points. The execution point (or thread of control as Wikipedia puts it) could refer to a processor or even a computer as a whole. 

Confusion mainly arises between the latter two. A few examples did the trick for me:
  • parallel processes can't be executed on the same execution point, say a processor. While, on the other hand, the concurrent execution of two processes can take place on the same processor by context switching.
  • a concurrent computation MAY be executed in parallel.   
I bring in parallelism here just to get the basic understanding of the differences right. We will continue to focus on concurrency and distribution.

A problem with concurrent execution of computational tasks is keeping the outcome deterministic i.e, if parts of the tasks are run in a different order, they must produce the same outcome. When the processes communicate with each other, the order in which they communicate cannot easily be established as we don't have control over the order of execution (in context switching for example), hence there is an explosion in the number possible execution paths of the system and which one exactly isn't known - this introduces non-deterministic behavior in the system. We need a framework that allows us to think about and approach communication in a deterministic manner without having to handle too many constraints. What better place to introduce this than the programming language itself? More on this later.

Communication

Communication, in general, can be implemented in different ways.

Communication between concurrent processes on the same machine can be achieved by writing and reading from the same memory block - called shared memory communication. I find this approach very troubling and difficult to handle. The programmer has to explicitly go through the difficulty of locking, synchronizing and making the program thread safe.

Another method is message passing: the processes send messages to each other. For example, a message can be sent from one process to another by opening a socket.

In the case of distributed systems, shared memory doesn't apply as all the components have their own individual private memory. Which leaves us with passing messages between the components.

Although, in practice, while building a distributed system, we often simplify communication by mocking a shared memory using a centralized data-store and building stateless components. In this case, I think, we avoid the problem, not solve it. I am not sure if such a system can be called "truly distributed". What happens to the system when the components cannot connect to the centralized database due to a network failure?

Such a system trades some fault tolerance for simplicity of the system and may not be suitable for all use cases. Additionally, it brings in the complexity of making the components stateless and hence having to query the database often - which would also affect the performance of the system. A centralized registry can be used as a backup store to restore the state of the system in case of an partial or universal failure. But, using it as the primary mode of communication doesn't seem to be the best option for the reasons stated above.

Along with messaging, arise a number of requirements such as a messaging protocol, message delivery mechanism, component discovery etc. These are topics we will eventually be discovering in depth. There are a number of open questions here: What about run time safety in distributed systems? Can we identify protocol violations while implementing the system i.e, before run time? How do we debug errors efficiently?

As per my current knowledge, it is my understanding that communication (more specifically message passing) is what brings these two fields together. Being two different fields, they would (probably) have different approaches and terminology for a similar problem - both theoretically and practically. It would be fruitful to study both these fields and especially their intersection. In the process I hope to study relevant formal models and evaluate various tools such as Erlang, Lasp lang, Vert.x, Akka, Scribble, Mungo and other similar ones which come up as a part of the study or via suggestions.

References: