Today, live or up-to-date third-party datasets provide a wealth of information that needs to be correctly analyzed at a scale that has traditionally required custom solutions and many person-hours of work. On top of that, sharing these analyses is critical for avoiding duplicate work.
This blog post demonstrates how Ibis works with BigQuery to provide an elegant and flexible Python interface for composing SQL queries.
As an example, consider the following SQL query that computes the acquisition rate of a number of different companies. Here the acquisition rate is defined as the number of acquisitions (or IPOs) divided by the number of investments a particular venture capital (VC) firm has made. It’s a complex query that proves to be as difficult to debug as it is to compose:
There’s a bug in this query, and it’s pretty difficult to spot.
Spoiler alert: the tables in the “t
” subquery’s FROM
clause are missing aliases that are referenced elsewhere in the query.
The error in this query is difficult to spot because it’s a single character in a sea of text. SQL has evolved through years of engineering to take on more capabilities with more syntax and while it has grown more powerful, it has also grown in complexity.
However, SQL isn’t going anywhere anytime soon. There are some extremely powerful analytic SQL engines available including Apache Hive, Cloudera Impala, Amazon’s Redshift, and of course: Google BigQuery. The common thread among all of these systems is SQL.
One of the downsides of using SQL is that it’s a closed system: you can’t easily use another language to solve the same problem at the same scale in the same amount of time. This leads to code that passes opaque SQL strings around, often to the chagrin of the folks reading and writing it. Imagine the query above but with template parameters and additional string processing. It quickly becomes hard to manage this kind of code, let alone turn it into something that can be trusted and widely used across a large company.
Over in the Python world, libraries such as pandas provide convenient and performant APIs for wrangling data. From joins to time series analysis, pandas makes it convenient to do complex data wrangling and analysis from start to finish. One downside of pandas is that it’s challenging to use as you encounter the need to distribute your computations.
How Ibis helps you scale
Ibis is a Python analytics library designed to provide the convenience of pandas’ APIs with the scalability of analytic SQL engines like BigQuery. It does this in a type-safe way, letting you build analytics expressions that compile to SQL and run on your favorite large-scale SQL engine. When you execute Ibis expressions, they turn into pandas DataFrames, which gives you access to the ecosystem of Python data libraries once you don’t need the scale that distributed SQL provides.
With Ibis, you no longer need to think about certain classes of bugs like the aliasing problem we saw earlier. Invalid operations (like taking the sum of a string column) are simply not available in the API.
Another goal of Ibis is that the expressions it creates should be composable. In this scenario, composable means that expression methods return other expressions. The following Ibis code chains together expressions to compute the percentage of GitHub repositories created between the hours of midnight and 4 AM, relative to the time zone reported in the created_at
field of the table:
Looking at df
—the result of executing the expression—gives the following result:
This example shows the composability of Ibis and just how concisely you can write queries.
BigQuery and Ibis comprise an extremely powerful combination. The Ibis team closely collaborates with the folks working on the BigQuery Python interface and we aim to keep it that way!
Check out the documentation here, and the community tutorial to get a feel for what it’s like to use Ibis.
A version of this post originally appeared on the Google Cloud blog.