Google Custom Search API

Quite recently i needed to find a way to extract a lot of information from the World Wide Web. I was working for my masters final paper and needed a lot of file in SQL format so that i may process them and extract information.

The problem was in how should i get an amount of data that big? Well after searching a bit i found the answer: Google Custom Search API.

I’ve searched for examples on how to use the API but unfortunately most were not complete or using deprecated versions of the API. Also you can parse the output in JSON format but in the newer versions of the API you have to implement a JSON factory to map it to my Java code. Well i wanted something much faster and with little more flexibility since i only wanted to extract the URL’s and next page from the output JSON.

So to start using the seach api, first of all you need to secure some prerequisites:

So in the example above i was looking for filetype:sql, meaning all results should be files in SQL format.

The next step needed is a HTTP connection towards the URL we construct above:

We open a connection on the search URL, tell the connection to expect results in JSON format and since the Google Search API is using the REST template, we must also specify the REST method expected; in this case  of “GET” type.

After we have a connection opened what we need next is to establish a buffered data stream and start processing the data:

By this point we have the result of the search query buffered. The next step would be to extract the information we need from the buffered input stream. And since we won’t be using a JSON Factory we need to process the output line by line. A thing that can be achieved like in the snippet below:

In this snippet what I’ve did was to parse the stream line by line and search for 2 significant groups for me: the “totalResults” and “link” groups in the output JSON. The first one is needed to compute the page i am on, and the second to extract the information i need. As you can see on the last line of the second If, I add the link to a list of strings.

If you want to go further than the first page of search results you need to wrap the code above in a method with the following signature:

And after wrapping the method you need the snippet below which increments a counter with 10 items (google only allows 10 items a page in the search result, thus you need exactly the maximum free of 100 api calls to extract 10 pages of information) and after that i call the method recursively.

And now to test the code above you can use:

This was the final step. Normally if you properly configure a custom search engine and get an API key from Google the above code should return you a list of strings containing the URLs from the results.

Thank you and i hope it helps you.