Interactively Analyze Big Data in the Cloud using Google BigQuery
Organizations and businesses today live and breathe data. Data is being used for a variety of different purposes such as understanding and predicting, user traffic and trends, improve supply chain management, better product marketing, sentiment analysis, and the list pretty much goes on and on and on.
Google is undoubtedly one company that has massive sets of data in the form of webpage indices, advertising data, map and street view images and traffic information, and hundreds of other things. Having built a business on storing, analyzing and understanding Big Data, Google realized it should take advantage of its expertise in building the infrastructure to handle those petabytes of information to offer a data analytics service in its cloud.
Google BigQuery is this offering, made publicly available in May, 2012. It was being discussed for some time since 2010 after a Google Research Blog mentioned it, and then a limited preview in November 2011, before being launched publicly.
What is BigQuery?
As answered on Google BigQuery website: “Google BigQuery is a web service that lets you do interactive analysis of massive datasets—up to billions of rows. Scalable and easy to use, BigQuery lets developers and businesses tap into powerful data analytics on demand”.
Offered as a web service, BigQuery is basically a Platform-as-a-Service solution. Meaning, it runs on Google infrastructure, on thousands of servers, and you really don’t have to worry about troubleshooting machines, making sure that services are up and running, or scaling. All you have to do is—well—query.
How does it work?
You can use BigQuery through a web interface called the BigQuery browser tool, a Python-based bq command-line tool, or by making calls to the REST API using various client libraries in multiple languages, such as Java, Python, etc.
Using BigQuery is a three-step process: upload, process, and act. The first part is uploading data into the Google cloud. Second, importing data into a BigQuery Table. Importing transforms your data in a format that is efficient to run queries on so that you don’t need to worry about indexes, sharding your data, or partitioning your data in ways that are efficient for access. If you have a csv file ready containing all your data, this is actually a single step. As in the screenshot below, Google BigQuery lets you upload and import data in one step.
Once you have imported the data, you can query on it in whichever way you like. BigQuery uses an SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries without any programming. To help form queries, it has a bunch of functions, mathematical functions, time functions, functions to translate data into human readable formats, or to extract parts of your data that you need to group by or order by.
One additional thing that BigQuery provides is statistical versions of certain functions. When you’re dealing with very large datasets, sometimes it’s useful to trade a little accuracy for a considerable gain in speed. Implying this trade off, it provides a few statistical operators that allow you to do common operations like top most, frequent, or count distinct, etc.
The API is a standard RESTful API. There is an endpoint that allows you to find out what the schema of your table is with–once the data has been imported. So the API will come back with a list of fields and their types, and then, endpoint for queries. You pass it to your query, referring to your tables in the SQL statement, and voila– have your result set. You can then save this resultset as an independent table in BigQuery or download as a CSV file.
The API is protected with the standard Google authentication mechanisms, accepting all the common Google credentials, uses HTTPS, and exposes the tables as objects in Google Storage for developers’ namespace. Tables are created with private access so that only you can access and query those tables. But you can add other owners or other people who can access and manage or query your data as you deem fit for your application.
The API can be embedded in different places. That is how you have the browser tool and the command-line python tool.
Behind the scenes
Google BigQuery is based on another tool called Dremel. Running across thousands of servers, it lets you query large amounts of data, such as a collection of web documents or a library of digital books or even the data describing millions of spam messages. This is akin to analyzing a traditional database using SQL that has been successfully and prevalently used for decades. The difference is that Dremel can handle web-sized amounts of data at blazing fast speed. According to this paper by Google, you can run queries on multiple petabytes in a matter of seconds.
Hadoop already provides tools for running SQL-like queries on large datasets. Sister projects such as Pig and Hive were built for this very reason. The difference here is that with Hadoop, there’s lag time and that it is a batch processing platform that may take a few minutes to few hours to run the task. But Dremel was specifically designed for instant queries.
Integration with other Google products
BigQuery is also easily integrated with other Google products such as Google App Engine and also Google Spreadsheets. You can insert a BigQuery query job and populate the spreadsheet with the result data, effectively inserting multi-billion rows in a Spreadsheet and share it with others. You can also visualize BigQuery results in a chart, or schedule daily query result updates, as this tutorial explains.
Developers and businesses can sign up for BigQuery online and query up to 100 GB of data per month for free. After that, there is a simple pricing plan for storing and querying datasets: $0.12 per GB/month up to 2TB storage, and $0.035 per GB processed in 20,000 total queries per day. And you’re charged only for the data processed in a column of the data, not for the entire table. If you need more than that, you may have to contact a sales representative.
Getting started and beyond
Google BigQuery has exhaustive documentation and tutorials. You have the BigQuery browser tutorial and the bq command line tool tutorial, a query reference to learn some of the intricacies of making queries, a developer’s guide, and the latest API reference. It also has a great community for developer support and reading about new features, and to connect with other users.
There are a few sample databases on the browser tool, to help you practice and get used to BigQuery before you decide to crunch all that data and upload it to Query. One of the tables is a dump from Wikipedia, housing 314 million rows in 35.7GB of disk space. A typical count query for me took 3.2 seconds, and a Select query just 4.7.
What’s most interesting about BigQuery is the fact that it provides Big Data analytics in a completely hosted offering. Organizations don’t have to procure and set up hardware for a Big Data infrastructure. They don’t have to worry about setting up Hadoop or any other software. It’s Big Data available instantly, and at a fairly affordable price.
As an off-the-shelf service, it could be a bit less flexible than what developers could get out of a tool built with Hadoop, Hive, etc. However, it is likely to be effective for quite a few organizations and developers who need big-data tools quickly and can work within the limitations of BigQuery. It could be a boon for startups, for example, who can begin using BigQuery as they start up, with smaller datasets, and limited resources, including time, the one thing with regards to which BigQuery beats every other tool out there in the market.