Saturday, 22 August 2015

Trial and error is not ugly

Is trial and error good attitude towards problem solving?

A couple of months into the fast paced startup environment right after college, I had plenty of questions on solving problems - about what was the right way to do things. In a startup environment, you are often faced with big problems which are to be solved within a limited time. Which means, you often face a situation where you have to come up with a solution in a time period lesser than what it might take you to actually find and understand an existing solution. "But, how do you know you are right? What if there is a better way, a more strategic way?" - where the questions ringing on my head.

For example, in the startup I previously worked for, we were often faced with problems where you had to classify data among several categories. More or less a spam problem. A naive approach to solving this problem was by simply setting thresholds for values of specific vectors. We adopted this approach because it got things working quickly. We often experimented with these values to ensure there wouldn't be any false positives. We had no idea of what was right or wrong. Very often there were false positives and we would simply tweak the specific value further to get things working. I found this very disturbing.

Sure there were more formal methods to solve this problem like linear regression or Support Vector Machines which could provide better accuracy, but even trying out these models needed you to make heavy assumptions about the data. You had to choose the feature vectors carefully. How do you know which was correct set of vectors? Trial and error (or prior experience of trial and error of course).  And this would require time - which you will never have in some startups. But that wasn't the main reason I didn't appreciate this approach - I didn't like trial and error. I wanted to study thoroughly the background and identify the correct model to use for the problem. Which, as it sounds, is very expensive.

There were many similar problems too - Do daily stand-ups help? Are design meetings worth the time? Do we need to document our infrastructure? What were right timeouts to be set on the HTTP clients? What is the advantage of using exponential backoff?

And advice poured in from many directions for problems. If you are startup kid, you a get a lot of advice from the tech community outside (like parents advise their kids). Not all of them were useless advice. But, the one thing that was obvious from all the inputs, is that "it depends". There were too many factors involved. We couldn't directly download a solution like it were a library and simply set the context so that it would work.

If you tried to generalize on a solution for such problems, it looked like trial and error was the only solution that worked. This again, was disturbing. "That is so not the right way", I thought.

So here is what, (I think) I failed to realize back then:

The key idea here is that - with every trial, you find some errors in the assumptions you had made previously. In other words, you understand the problem better. It helps you define the problem more clearly. 

It was probably from college, that I picked up the idea of trial and error being ugly. Problems in real life are different from problems in the academic space. In the academic space, problems are well defined. In real life, they aren't. You often have to figure them out or define them yourself. You don't have a standard in which problems are well defined. In such a scenario, I think trial and error is a very good methodology to approach problem solving. 

I am not saying trial and error is better than more strategic ways. I am simply saying - trial and error is a good option for defining problems clearly. Post which, more strategic and established solutions would do a great deal in solving the problem more quickly than trial and error. So it wouldn't be fair to rule out trial and error from the "solution tool set". Because, (as some of my mentors have often mentioned) the first step to solving a problem is to ensure that it is well defined. 

Similar thoughts:

Tim Harford: Trial, Error and the God complex is almost on similar lines.

Saturday, 25 July 2015

Formalizing problems

This is a TIL post. The problem was disguised as a simple inequality exercise in Norman L Biggs' Discrete Mathematics book.

So here's the deal:

Problem: Prove that in a square matrix, the least of the greatest valued members of each row is greater or equal to greatest of the least value members of a each column. (For convenience, I have assumed the matrix consists of only natural numbers)

Formalizing: 

With some help from math.stackexchange.com, I was able to formalize this as:

Prove that Miny(Maxxf(x,y)) >= Maxx(Minyf(x,y)) where f(x,y) is basically any value in the matrix and x and y refers to the row and column respectively.

Proof:

Let's break it down. From general definitions of Min and Max, we know that:
f(x,y) >= Min(f(x,y))
and
Max(x,y) >= f(x,y) - (I)

So, it simply boils down to proving the statement:
Maxxf(x, y) >= Minyf(x, y)

Let xo be a row at which the value Maxxf(x, y) is found. As a result, f(xo, y) would be the greatest possible value for any y  in the xo row - (II)

Let yo be a column at which the value Minyf(x, y) is found. As a result, f(x, yo) would be the lowest possible value for any x in the yo column. - (III)

From (II),
 f(xo, y) >= f(xo, yo)
And from (III),
f(xo, yo)>= f(x, yo)

What we just proved is that there exists a f(xo, yo) that belongs to N, such that
f(xo, y) >= f(xo, yo) and f(xo, yo)>= f(x, yo)

Hence, we can claim using axioms of inequality and natural numbers that:
 f(xo, y) >= f(x, yo)

or in other words,

Maxxf(x, y) >= Min y f(x, y)

which can further be extended to

Miny(Maxxf(x,y)) >= Maxx(Minyf(x,y))

using the reasoning in (I)

Summary

Although it took me a while to understand this problem, it took me much lesser time to identify the mental route to understanding this problem (commonly conveyed as "connecting the dots") once it was formalized and expressed as a statement. So the formalization process was worth it because, it helped me clearly understand the problem I was trying to solve. I have not quite been the fan of formal mathematical representation in the past, but I guess that just might change in the future :)

Another take-away realization from today: there is a lot of meaning in "break down problems before you try to solve them". This is just a simple example.

Sunday, 10 May 2015

Misconception about OAuth and the ignorance of history

I feel the usage of the OAuth is often misunderstood. It is interpreted as a mechanism which can be used to secure the access of data behind some web API. This is often incorrectly extended to an understanding of OAuth as an authentication protocol.

OAuth is in an authorization protocol. It is was designed for allowing a user to authorize third party applications to access her resource without having to share her credentials with the third party application. Plenty of documentation and posts already exists online which stress on the same aspect. 

I wasn't born knowing this. I just happened to give the "History" section of the documentation equal importance, as I did to the structure and protocol workflow. The history of a software/programming language/standard or any solution, is as important as the solution itself. It helps one understand the actual problem it solved by coming into existence. This understanding helps a great way in assessing the usage or application of a solution for a specific problem. 

In my opinion, there is no better way to avoid committing less mistakes when solving a problem than understanding the previous attempts to solve it.

Thursday, 7 May 2015

Why you should use cURL + firebug to scrape

One could always go about web scraping by using the n-number of libraries available online. But if you want to master the art of web scraping, I would recommend you try fetching content from websites using cURL and examine the website's communication structure with firebug on. Although it might seem extremely low level and unnecessary, it has multiple advantages from an understanding perspective.

The biggest advantage that I can think of is the clarity in your thought process you get about web scraping. When you begin observing the "NET" console of firebug to examine the website, you start understanding that the process is actually very simple. Be it AJAX, XHTTP, JSON or whatever fancy higher level jargon/abstraction that there is in the communication between the client and the server, it all boils down to GET and POST HTTP methods. No matter how unreadable the obfuscated java script code is, it won't bother you. Because you would be able to predict the exact behavior of how the website gets the content to your browser.

I am not saying you should not use high level libraries to scrape, I am only suggesting you use cURL and firebug till you understand that it is practically possible to scrape content from ANY website. Also, in the process you learn a lot about HTTP requests, responses, headers, cookies and in general, a lot about the way websites and web applications work.

Oh and did I forget to mention? If you own a website and are tired of bots hitting your website, I suggest you attempt to scrape your website using the same combo in order to understand how others are doing the same. Yes, ethical scraping is an actual thing!

Happy scraping :)

Wednesday, 1 April 2015

Filtering and identifying IP addresses uniquely

While developing cloud services or web applications, one might often want to filter out requests from specific IP addresses like 127.0.0.1 (localhost), 192.168.0.0 (private network), etc. While blocking these IP addresses at the firewall level is an obvious choice, sometimes one might want to handle them at the application layer.

This article discusses an optimal solution to identify such IP addresses. These IP addresses belong to a range of Reserved IP addresses.  I am going to assume that you want to identify some of the address blocks on this wiki.

A naive approach is to examine the prefix of the IP address string in the HTTP header by extracting the IP address and checking if it has a prefix that corresponds to a reserved CIDR. Example: Suppose that you want to filter all the request hitting your servers from 127.0.0.0/8 and 10.0.0.0/8. This can simply be done by checking if the incoming HTTP request is originating from an IP address string which starts with "127." or "10." .

While that would work, this approach becomes a problem when you want filter 172.16.0.0/12 as the number of string comparisons would drastically increase in this case (16 prefixes ranging from "172.16." to "172.31." )

Well, here's an approach to avoid this problem:

The IPv4 address space ranges from 0.0.0.0 to 255.255.255.255 or in other words, it can be treated as a number with base 256 after eliminating the '.' delimiters. When treated as a number, the problem simplifies greatly as the only thing to do now is to check whether the given number in base 256 (the IP address) falls within a set of ranges (or reserved IP address blocks) which are in base 256 as well.

I going to assume your programming language of choice and you are accustomed to the decimal number system (base 10), so I am taking the approach of converting the IP address and the reserved IP address blocks in base 256 to their corresponding representation in base 10.

Example:
Assume the IP address to be examined is a.b.c.d. We want to check if it belongs to 127.0.0.0/8, 10.0.0./8 or 172.0.0.0/12.

Step 1 - Convert the incoming IP address to base 10.
  • Base 10 representation of a.b.c.d is:
    a*256^3 + b*256^2 + c*256^1 + d*256^0
  • Let this be equal to x.
    x = a*256^3 + b*256^2 + c*256^1 + d*256^0

Step 2 - Check if x (base 10) falls within the range of the IP address blocks' corresponding base 10 representation
  • 127.0.0.0 - 127.255.255.255 (127.0.0.0/8) -> 2130706432 - 2147483647

    if x >= 2130706432 && x <= 2147483647?
  • 10.0.0.0 - 10.255.255.255 (10.0.0./8) -> 167772160 - 184549375

    if x >= 167772160 && x <= 184549375
  • 172.16.0.0 - 172.31.255.255 (172.16.0.0/12) ->   2886729728 - 2887778303

    if x >= 2886729728 && x <= 2887778303
Also, this base conversion approach can be used to represent a IP address uniquely using a decimal number. If you feel the numbers are too big to be easily managed in your system, applying the logarithmic function on x could be useful i.e, using log(x) as the IP's unique identifier instead of x.

Wednesday, 4 March 2015

Demystifying static

I had originally started writing a post with the intention of disambiguating a confusion involving static variables, static methods and their behavior in multi-threaded environments in Java. But in the process I realized that the static keyword needs a lot more explanation that I indented to provide. On a hunt to provide a reference to take care of the same, most of the search results which turned up on Google didn't satisfy me. They only succeeded in highlighting the behavior or use of the  static keyword in a very language specific fashion. I wanted to focus on highlighting the meaning and nature of staticHence, a separate post!

Completely understanding static can be tricky as it's language specific definition is often overloaded. But it's purpose in a language remains more or less the same. Understanding this will make most of the semantic rules involving static intuitively derivable. 

DISCLAIMER: This post will attempt to remain language independent and hence all of the below stated characteristics may not be true for the static keyword of all languages.


So, What is static?

 

Binding is the association of an attribute and an entity [1]. For example, when a memory block is associated with an identifier it is called name binding [2]. Another example being type binding - the association of a variable with it's type. A binding can happen at any phase of the execution of a program like compile time or run time.

 static indicates a nature of binding called static binding. When a binding takes place during compile time, it can be referred to as static. Hence, declaring a variable as static can be interpreted as instructing the compiler to perform the bindings associated with the variable during compile time. 

By definition, a static binding cannot be changed during the execution of a program post compile time i.e, in other words, a static binding cannot be changed during run time. Bindings which occur during the latter phase are called dynamic bindings. Dynamic bindings are beyond the scope of this post. 


Static type binding

If the type of a variable is resolved and bound during compile time and cannot be changed post compile time, then the type binding is said to be static.

Static storage binding  

If memory is allocated to a variable during compile time and the variable's identifier remains bound to the allocated memory, then the storage binding is said to be static. [3] The memory cells bound to the identifier do not change during the entire life of the program's execution.


Lifetime of a variable is the time during which a variable is bound to particular address. [6] In simpler words, the lifetime of a variable is the time for which a  variable is alive. Note that this is not to be confused with "scope". Scope of a variable indicates the visibility of a variable. It is possible for a variable to be alive and inaccessible, or in more technical terms - it is possible for a variable to maintain it's state while it's visibility is delimited by it's scope. Eg: private static field variables in Java.

Static lifetime

A static variable is "alive" throughout the entire execution of the program. 


Well, Why static?


Static can be helpful in many ways because of it's early binding time. Early binding implies:
  • High type safety is guaranteed as the type does not change [4]
  • Run time efficiency increases as type resolution is done during compile time [4]
  • High memory safety is guaranteed as memory allocation is done during compile time and is never changed
If you know the basics of compiler and language theory, you would understand the seriousness of type safety and memory safety. And as for memory safety, I think a little programming in C would do it too!


Summary:

Constraints in a language aren't intended to make your life difficult. In fact, they are results of the decisions taken during the design of the programming language to actually make your life better. Ever wondered why a break is mandatory in every case block of the switch-case construct?
 
References:
[1] http://courses.cs.vt.edu/~cs3304/Spring04/notes/Chapter-4/tsld008.htm
[2] http://en.wikipedia.org/wiki/Name_binding
[3]http://groups.engin.umd.umich.edu/CIS/course.des/cis400/maxim/lectures/chp4.htm
[4] http://courses.cs.vt.edu/~cs3304/Spring04/notes/Chapter-4/tsld009.htm
[5] http://docs.oracle.com/javase/specs/jls/se7/html/jls-8.html#jls-8.3.1.1
[6] http://users.dickinson.edu/~wahlst/356/ch5.pdf

Monday, 5 January 2015