One could always go about web scraping by using the n-number of libraries available online. But if you want to master the art of web scraping, I would recommend you try fetching content from websites using cURL and examine the website's communication structure with firebug on. Although it might seem extremely low level and unnecessary, it has multiple advantages from an understanding perspective.
The biggest advantage that I can think of is the clarity in your thought process you get about web scraping. When you begin observing the "NET" console of firebug to examine the website, you start understanding that the process is actually very simple. Be it AJAX, XHTTP, JSON or whatever fancy higher level jargon/abstraction that there is in the communication between the client and the server, it all boils down to GET and POST HTTP methods. No matter how unreadable the obfuscated java script code is, it won't bother you. Because you would be able to predict the exact behavior of how the website gets the content to your browser.
I am not saying you should not use high level libraries to scrape, I am only suggesting you use cURL and firebug till you understand that it is practically possible to scrape content from ANY website. Also, in the process you learn a lot about HTTP requests, responses, headers, cookies and in general, a lot about the way websites and web applications work.
Oh and did I forget to mention? If you own a website and are tired of bots hitting your website, I suggest you attempt to scrape your website using the same combo in order to understand how others are doing the same. Yes, ethical scraping is an actual thing!
Happy scraping :)
The biggest advantage that I can think of is the clarity in your thought process you get about web scraping. When you begin observing the "NET" console of firebug to examine the website, you start understanding that the process is actually very simple. Be it AJAX, XHTTP, JSON or whatever fancy higher level jargon/abstraction that there is in the communication between the client and the server, it all boils down to GET and POST HTTP methods. No matter how unreadable the obfuscated java script code is, it won't bother you. Because you would be able to predict the exact behavior of how the website gets the content to your browser.
I am not saying you should not use high level libraries to scrape, I am only suggesting you use cURL and firebug till you understand that it is practically possible to scrape content from ANY website. Also, in the process you learn a lot about HTTP requests, responses, headers, cookies and in general, a lot about the way websites and web applications work.
Oh and did I forget to mention? If you own a website and are tired of bots hitting your website, I suggest you attempt to scrape your website using the same combo in order to understand how others are doing the same. Yes, ethical scraping is an actual thing!
Happy scraping :)
No comments:
Post a Comment