Selling Digital Goods in Data Markets: Part 2, Accessing Data
In the first installment of this series, I introduced the concept of data markets. I outlined the types of markets and their main features, and then highlighted four of the major data market players.
In this article, I’ll show you how to go about choosing markets to use and then extracting data from them.
Comparing data market features
As a reminder, here are some of the desirable general purpose data market features I discussed in the previous article:
- A free level of developer access
- A variety of data spanning a wide range of topics
- Several different methods to access the data, including at least data dumps (think CSV) and a web API (see the next point for preferred mechanisms)
- RESTful API which returns JSON; also having a YQL binding (click here to learn more about YQL and how to use it) is even better
- General purpose client libraries in your language of choice (for me, Python or Java)
Given my criteria above, the question I need to answer is which of the four markets that we looked at in the first installment of this series best meet my needs?
Here’s a detailed breakdown of the markets addressing each of the desirable features:
Data Market (alphabetical) | Free Access | Topics and Datasets (as of this writing) | Data Dumps | Web API | Libraries and SDKs |
---|---|---|---|---|---|
Factual | Free “as is” API & SDK access, data dumps for developers | 597,620 datasets spanning a wide range of topics | CSV (free, premium datasets available upon request; search Factual for “downloadable” data here) | RESTful with JSON responses (read the Factual documentation including REST API docs) | JavaScript+HTML for web apps, iPhone SDK, Android SDK (under development), FactualR for R developers and researchers, ruby-factual gem, and Net-HTTP-Factual for Perl |
Freebase | Most of Freebase’s data is available for free under the Creative Commons Attribution license | Millions of entities across hundreds of types and a variety of domains (click to see the schema and search the entity graph) | Full dumps in TSV format (click to access recent) | RESTful with JSON responses (includes MQL queries and web API) | A number of different language librairies (eleven as of this writing including Python and Java) |
Infochimps | Free “Baboon” level plan (100,000 API calls/month, 2,000 calls/hour burst, attribution required) | 12,995 datasets spanning a wide range of tagged topics including “bigdata“ | Varies for each data set; mostly TSV, CSV, or YAML | RESTful with JSON responses for some datasets (read API documentation); YQL bindings available for a small subset of Twitter-related calls (influence, trstrank, and wordbag) | Ruby, Python, PHP, and command line; Python libs for Twitter dataset operations include infochimpy and python-infochimps |
Windows Azure Marketplace DataMarket | Some free datasets available, but no easy way to search for them | 86 datasets across a variety of categories | No documentation found including no mention in the FAQ | Uses Microsoft backed OData which provides a consistent API across datasets | Focused around supporting Microsoft-specific technologies (C#, Windows Phone 7, Silverlight, etc.) |
Eliminating data markets that don’t meet our needs
Time to start eliminating market candidates. One of the first things you may notice above is that the Microsoft market contains a much smaller number of datasets than the other data markets we’re considering. This may be because of its relatively young age, as it was only launched in late 2010. Nonetheless, fewer datasets means fewer options in the interesting things you can do with the market. This would be a deal-breaker for me by itself, but in addition it also fails to provide any libraries or examples for non-Microsoft programming languages and tools a la Python. Therefore I’m removing Windows Azure Marketplace DataMarket from further consideration.
The next factor I’m considering is how much structure is imposed by the market on its datasets. I want the freedom to draw my own conclusions from the data rather than being forced into a market’s predetermined taxonomy, hierarchy, or categorization of its contents. Since Freebase is built from the ground up with a heavily structured entity-oriented approach, I’m also eliminating it as an option. Keep in mind that if you want to operate on entities, however, Freebase might be a good choice.
This leaves me with needing to choose between Factual and Infochimps. Both offer a significant amount of free API and data dump options for developers, both have a wide range of datasets available, and both provide a RESTful web API with JSON server responses. It’s possible that both data markets have something I’d like to use, and that’s ok if so. Or else perhaps I’ll decide to stick with just one for my particular application. Let’s explore the application to see what data is available in each market.
Finding data for the problem at hand
At this point it’s useful to look at the needs of our application and use that to search each of the markets for data that may be useful. Obviously your application dictates your dataset choices, so any given market may or may not meet your needs for a particular application. In other words, Your Mileage May Vary.
Given that disclaimer, we do need to pick some sort of example application to make the rest of this article specific enough to be illustrative.
For this particular article’s purposes, I am interested in building an application that uses Twitter user information. I’d like to have access to general purpose user information, i.e. a large dataset across all Twitter users. I’d like to be able to query that user information to figure out important things about users, especially their influence on other Twitter users.
To locate potential datasets of interest, I searched Factual for datasets with “Twitter” in their table name, author, or description across “All Topics” and sorted by “Relevance” (click here to run the search yourself):
As pictured above my search returned a number of datasets including an alphabetical listing of Twitter services and applications, a short table of most-followed network and cable journalists on Twitter, and various other things. Unfortunately it does not return any sort of generalized Twitter user information. There does not appear to be a dataset covering that sort of information in Factual, so let’s turn to Infochimps and see if we can find what we’re looking for.
Performing a similar search of Infochimps (click here to run the search yourself), we get:
Note that Infochimps also uses a system of tags to help you find datasets with certain qualities. You can see a complete list of Infochimps tags here, or click here to see datasets tagged “twitter”.
Reviewing the results from the Infochimps “Twitter” search, you will see a number of free datasets. You’ll also see some datasets that cost money, some that are located offsite (not in the Infochimps system), and some that provide an Infochimps-supported API. We’re most interested in those with an API for this article. Infochimps has tagged API-enabled datasets with the tag “awesomeapi” (click to search):
Let’s search for datasets matching both our “Twitter” and “awesomeapi” interest (click here):
Since we want to get a measure of Twitter influence via an API, let’s add one more thing to our search, “influence”. Click here to execute the search and see the narrowed down results below:
I’m going to use the influence metrics dataset; more on how to access its data below.
Before we move on, however, there’s one more tag you might want to note: “bigdata” denotes large datasets of the kind we discussed in the first article in this series. You can combine awesomeapi+bigdata to get a list of the corresponding big datasets with an API.
Accessing Twitter influence data
Thankfully after all the work to locate the data, actually accessing it is relatively easy. The exact process varies from one data market to another, but you can draw some general conclusions from how Infochimps does it in the particular example we’re using here.
Click here to read the complete article on the PayPal X Developer Network including the details of accessing Twitter influence metrics from the dataset’s page, via REST, and via YQL . A Python example is provided. You may also wish to read a related blog post concerning what I learned from writing this article and a YQL gotcha to avoid.
Comments are closed.