Sunday 10 May 2015

Misconception about OAuth and the ignorance of history

I feel the usage of the OAuth is often misunderstood. It is interpreted as a mechanism which can be used to secure the access of data behind some web API. This is often incorrectly extended to an understanding of OAuth as an authentication protocol.

OAuth is in an authorization protocol. It is was designed for allowing a user to authorize third party applications to access her resource without having to share her credentials with the third party application. Plenty of documentation and posts already exists online which stress on the same aspect. 

I wasn't born knowing this. I just happened to give the "History" section of the documentation equal importance, as I did to the structure and protocol workflow. The history of a software/programming language/standard or any solution, is as important as the solution itself. It helps one understand the actual problem it solved by coming into existence. This understanding helps a great way in assessing the usage or application of a solution for a specific problem. 

In my opinion, there is no better way to avoid committing less mistakes when solving a problem than understanding the previous attempts to solve it.

Thursday 7 May 2015

Why you should use cURL + firebug to scrape

One could always go about web scraping by using the n-number of libraries available online. But if you want to master the art of web scraping, I would recommend you try fetching content from websites using cURL and examine the website's communication structure with firebug on. Although it might seem extremely low level and unnecessary, it has multiple advantages from an understanding perspective.

The biggest advantage that I can think of is the clarity in your thought process you get about web scraping. When you begin observing the "NET" console of firebug to examine the website, you start understanding that the process is actually very simple. Be it AJAX, XHTTP, JSON or whatever fancy higher level jargon/abstraction that there is in the communication between the client and the server, it all boils down to GET and POST HTTP methods. No matter how unreadable the obfuscated java script code is, it won't bother you. Because you would be able to predict the exact behavior of how the website gets the content to your browser.

I am not saying you should not use high level libraries to scrape, I am only suggesting you use cURL and firebug till you understand that it is practically possible to scrape content from ANY website. Also, in the process you learn a lot about HTTP requests, responses, headers, cookies and in general, a lot about the way websites and web applications work.

Oh and did I forget to mention? If you own a website and are tired of bots hitting your website, I suggest you attempt to scrape your website using the same combo in order to understand how others are doing the same. Yes, ethical scraping is an actual thing!

Happy scraping :)