Skip to content

Selling Digital Goods in Data Markets: Part 2, Accessing Data

In the first installment of this series, I introduced the concept of data markets. I outlined the types of markets and their main features, and then highlighted four of the major data market players.

In this article, I’ll show you how to go about choosing markets to use and then extracting data from them.

Comparing data market features

As a reminder, here are some of the desirable general purpose data market features I discussed in the previous article:

Given my criteria above, the question I need to answer is which of the four markets that we looked at in the first installment of this series best meet my needs?

Here’s a detailed breakdown of the markets addressing each of the desirable features:

Data Market (alphabetical) Free Access Topics and Datasets (as of this writing) Data Dumps Web API Libraries and SDKs
Factual Free “as is” API & SDK access, data dumps for developers 597,620 datasets spanning a wide range of topics CSV (free, premium datasets available upon request; search Factual for “downloadable” data here) RESTful with JSON responses (read the Factual documentation including REST API docs) JavaScript+HTML for web apps, iPhone SDK, Android SDK (under development), FactualR for R developers and researchers, ruby-factual gem, and Net-HTTP-Factual for Perl
Freebase Most of Freebase’s data is available for free under the Creative Commons Attribution license Millions of entities across hundreds of types and a variety of domains (click to see the schema and search the entity graph) Full dumps in TSV format (click to access recent) RESTful with JSON responses (includes MQL queries and web API) A number of different language librairies (eleven as of this writing including Python and Java)
Infochimps Free “Baboon” level plan (100,000 API calls/month, 2,000 calls/hour burst, attribution required) 12,995 datasets spanning a wide range of tagged topics including “bigdata Varies for each data set; mostly TSV, CSV, or YAML RESTful with JSON responses for some datasets (read API documentation); YQL bindings available for a small subset of Twitter-related calls (influence, trstrank, and wordbag) Ruby, Python, PHP, and command line; Python libs for Twitter dataset operations include infochimpy and python-infochimps
Windows Azure Marketplace DataMarket Some free datasets available, but no easy way to search for them 86 datasets across a variety of categories No documentation found including no mention in the FAQ Uses Microsoft backed OData which provides a consistent API across datasets Focused around supporting Microsoft-specific technologies (C#, Windows Phone 7, Silverlight, etc.)

Eliminating data markets that don’t meet our needs

Time to start eliminating market candidates. One of the first things you may notice above is that the Microsoft market contains a much smaller number of datasets than the other data markets we’re considering. This may be because of its relatively young age, as it was only launched in late 2010. Nonetheless, fewer datasets means fewer options in the interesting things you can do with the market. This would be a deal-breaker for me by itself, but in addition it also fails to provide any libraries or examples for non-Microsoft programming languages and tools a la Python. Therefore I’m removing Windows Azure Marketplace DataMarket from further consideration.

The next factor I’m considering is how much structure is imposed by the market on its datasets. I want the freedom to draw my own conclusions from the data rather than being forced into a market’s predetermined taxonomy, hierarchy, or categorization of its contents. Since Freebase is built from the ground up with a heavily structured entity-oriented approach, I’m also eliminating it as an option. Keep in mind that if you want to operate on entities, however, Freebase might be a good choice.

This leaves me with needing to choose between Factual and Infochimps. Both offer a significant amount of free API and data dump options for developers, both have a wide range of datasets available, and both provide a RESTful web API with JSON server responses. It’s possible that both data markets have something I’d like to use, and that’s ok if so. Or else perhaps I’ll decide to stick with just one for my particular application. Let’s explore the application to see what data is available in each market.

Finding data for the problem at hand

At this point it’s useful to look at the needs of our application and use that to search each of the markets for data that may be useful. Obviously your application dictates your dataset choices, so any given market may or may not meet your needs for a particular application. In other words, Your Mileage May Vary.

Given that disclaimer, we do need to pick some sort of example application to make the rest of this article specific enough to be illustrative.

For this particular article’s purposes, I am interested in building an application that uses Twitter user information. I’d like to have access to general purpose user information, i.e. a large dataset across all Twitter users. I’d like to be able to query that user information to figure out important things about users, especially their influence on other Twitter users.

To locate potential datasets of interest, I searched Factual for datasets with “Twitter” in their table name, author, or description across “All Topics” and sorted by “Relevance” (click here to run the search yourself):

As pictured above my search returned a number of datasets including an alphabetical listing of Twitter services and applications, a short table of most-followed network and cable journalists on Twitter, and various other things. Unfortunately it does not return any sort of generalized Twitter user information. There does not appear to be a dataset covering that sort of information in Factual, so let’s turn to Infochimps and see if we can find what we’re looking for.

Performing a similar search of Infochimps (click here to run the search yourself), we get:

Note that Infochimps also uses a system of tags to help you find datasets with certain qualities. You can see a complete list of Infochimps tags here, or click here to see datasets tagged “twitter”.

Reviewing the results from the Infochimps “Twitter” search, you will see a number of free datasets. You’ll also see some datasets that cost money, some that are located offsite (not in the Infochimps system), and some that provide an Infochimps-supported API. We’re most interested in those with an API for this article. Infochimps has tagged API-enabled datasets with the tag “awesomeapi” (click to search):

Let’s search for datasets matching both our “Twitter” and “awesomeapi” interest (click here):

Since we want to get a measure of Twitter influence via an API, let’s add one more thing to our search, “influence”. Click here to execute the search and see the narrowed down results below:

I’m going to use the influence metrics dataset; more on how to access its data below.

Before we move on, however, there’s one more tag you might want to note: “bigdata” denotes large datasets of the kind we discussed in the first article in this series. You can combine awesomeapi+bigdata to get a list of the corresponding big datasets with an API.

Accessing Twitter influence data

Thankfully after all the work to locate the data, actually accessing it is relatively easy. The exact process varies from one data market to another, but you can draw some general conclusions from how Infochimps does it in the particular example we’re using here.

Click here to read the complete article on the PayPal X Developer Network including the details of accessing Twitter influence metrics from the dataset’s page, via REST, and via YQL . A Python example is provided. You may also wish to read a related blog post concerning what I learned from writing this article and a YQL gotcha to avoid.

On data markets and YQL gotchas

I just submitted the second article in my current PayPal X DevZone series on data markets (read the first installment here) and want to share a couple of things I learned as a sneak peak of sorts.  I also want to call out a potential YQL gotcha I discovered developing this second article.

As a part of the article, I put together a table summarizing the data market features that matter to me.  What I learned from that exercise:  There is a lot of variability in supported programming languages from market to market.

If you’re considering different markets for your data needs, you should investigate their available libraries, including third party packages, up front.  That way you won’t get any surprises after you’ve already committed to a particular dataset (even worse if you had to pay something for that data).  For example, I want to use markets that support Python-based development, and not all of the markets I investigated do.  Whatever your language(s) of choice, I would encourage you to read the article once it’s published for more details.

Another thing that jumped out at me:  It is critically important that a market provides a good search interface to help locate pertinent datasets.  Some of the markets I investigate in my article do, others do not.  I’ll let you draw your own conclusions after you read the piece, but suffice it to say that I’m partial to the ones that make surfacing datasets simple and quick.

Now on to the YQL gotcha:  As part of my article, I developed a simple example that pulls Twitter user influence metrics out of an Infochimps dataset.  It does this using the Infochimps provided YQL influence data table.  My original YQL statement was naively:

select * from infochimps.influence where screen_name='billday' and apikey='api_test-W1cipwpcdu9Cbd9pmm8D4Cjc469'

This returns the expected influence metrics when executed in the YQL console:

When you run the same query embedded in Python code using the Python-YQL package, however, you get a “No definition found for Table” error (pictured in the ActivePython Community Edition shell):

Click here for the solution to the YQL problem available in the complete post on the PayPal X Developer Network.

21 Recipes for Mining Twitter by Matthew Russell

Cover of 21 Recipes for Mining Twitter by Matthew Russell

21 Recipes for Mining Twitter” by Matthew Russell provides readers with a problem-oriented crash course in using Python and freely available third party Python packages to mine social data from Twitter. It assumes familiarity with Python and makes quick progress through extracting and using different types of user and streaming data available from the Twitter API via the twitter package in particular.

I like the general approach of calling out problems to be solved, then addressing them one by one with a “recipe” for each. Some might complain that this approach results in disjointedness from one recipe to the next, but in fact that’s a feature, not a bug. “21 Recipes for Mining Twitter” is actually a spin-off of Russell’s more in depth “Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites” (also an O’Reilly book). The latter shows many more examples, and not just for mining Twitter, but also for harvesting and analyzing social data from many other services and APIs as well. So if you need down-and-dirty recipes for Twitter alone, get this book, and if you need more blanks filled in for Twitter mining and/or information on accessing data from other social services, get the other book. Or heck, get them both!

This book is recommended for: Anyone already familiar with programming who is looking to solve specific problems using Twitter data. It’s a slam dunk for Python programmers.

Recommended with reservations for: Non-programmers interested in social data analysis, with the caveat that they will probably need to spend some time working through the Python.org tutorial or getting up to speed with Python elsewhere before they can make great progress with this book.

Disclaimer: I know the author and have worked with him in various capacities on other projects, but not this book. Even if I didn’t know him, however, I’d still love and make use of this book.

Access this book’s O’Reilly catalog page by clicking here.

Notes from the week of 2011-03-13

PayPal X Platform

Big data

Wireless and mobility

  • The Register's article on the collapse of Nokia's technologies http://bit.ly/ggrFDK is the best I've read on why NOK failed #
  • Does Starbucks' card mobile program work precisely because they control five key elements described here? http://bit.ly/ee5awH via @mFoundry #

APIs and development

Personal things

  • Researching pulmonary embolism http://bit.ly/fkY1ov for a family member; thank God for warfarin http://bit.ly/g5DMKQ #
  • S#*! my wife says: "Toby Keith is the male Cher" (then sings "God love her" in Jack 2000-imitating-Cher voice) #

Running

Factual US POI table

Tulsa Running Club February 2011 newsletter

This month’s newsletter discusses the Post Oak Challenge, among other things.

http://static.issuu.com/webembed/viewers/style1/v1/IssuuViewer.swf

Apigee-to-Go embeddable consoles

Apigee has launched their new Apigee-to-Go embeddable consoles, and PayPal has availed themselves of the opportunity.

Why provide an embeddable version of their consoles?  According to Apigee:

Time and again we’ve heard API providers ask for a way to build their own consoles and give it to their own developers, on their own sites…  So today we’re launching Apigee To-Go, which lets you create your own API Console, skin it with the look and feel to match your brand, and then embed it where your developers are, on your own site or portal — for free.

You can read more about the new embeddable consoles via ReadWriteWeb’s coverage (click here).

Apigee simultaneously announced that three of their partners, PayPal, LinkedIn, and SoundCloud, are already offering custom-built consoles to their developers.  PayPal is currently making their console available via a link on the Adaptive Payments API page.

Here’s a look at PayPal’s console (click the screenshot to try out the console for yourself):

20110310_blog.apigee.to.go.png

Click to read the complete post on the PayPal X Developer Network.

Selling Digital Goods in Data Markets: Part 1, Innovators

Big data and data markets are hot topics these days. I think it’s important to understand a bit about the former in order to best comprehend why we’re seeing skyrocketing interest in the latter.

So what is “big data”? The best definition I’ve encountered to date comes from my friend Mike Loukides from his O’Reilly Radar post “What is data science?” (also available as a PDF report here):

“Big data” is when the size of the data itself becomes part of the problem

As the world generates incredible volumes of new data, the amount available for analysis (turning data into actionable knowledge) increases exponentially. This acceleration in data volume results in more and more problems you might want to solve being “big data” problems.

Want to analyze Middle Eastern dissent and citizen uprisings in Tunisia, Egpyt, and Libya by gulping tweets from the Twitter firehose? Big data problem. Build a location-aware mobile commerce app using phone sensor and cellular network information? Big data, check. Comparative genomic research across multiple genomes, big data, check check. You get the idea.

The difficulty for many developers is that sources of clean, easy to use “big data” are not always apparent. Enter data markets.

Data markets and their many flavors

There’s a fundamental question we need to answer right up front: What is a data market?

In researching this article, I was fortunate enough to spend some time discussing data markets with Grant Nestor of Factual (more on his company later). When I asked Grant to define “data market”, here’s what he said:

A data market is a destination where data is exchanged for other data, money, or things of value.

I like Grant’s definition for two reasons. First, it emphasizes that one can “go” to a data market, i.e. there’s a web presence where you can research and find available sources of data. Second, it points out that we don’t just buy data from markets, but rather that we might also exchange other data or things of value for it. Both of these will be important later as we talk about what features various markets provide.

Data markets can be further divided into several varieties based upon their focus and how they present their available datasets for consumption by developers, analysts, and other users. A recent Strata conference session on “Building and Pricing the Open Data Marketplace” (as reported in these notes from attendee Paul Miller) broke out several types of markets including:

  • Data catalogs – markets which pull together links to various datasets; the data may be hosted in the market’s own storage or it may be linked to elsewhere; catalog markets are meant to make it easier to locate datasets of interest, but the data itself may not always as “fresh” as the data provided by the next class of markets
  • Realtime feeds – services which provid direct access to constantly updating streams of relevant data a la Gnip’s Twitter feeds; super fresh, but can provide an overwhelming volume of data to store and process if you’re note careful
  • Free public data sources – often mandated by governments and NGOs, these provide access to useful data but the data may be poorly structured or “dirty”, making it more difficult for a developer to use
  • Graphics oriented services – services meant more for analysts than developers, heavy on built-in visualization tools and spreadsheet support but often lacking in programmatic (API) access

Which of these types of data markets will meet the modern web developer’s needs? I can’t speak to every possible scenario, but I do know what I’m looking for for my own web API oriented development.

Features I want in a data market

For this data market series I’m interested in data markets that will let us explore their offerings cheaply and efficiently. Here then is what I am looking for from an ideal data market:

  • The ability to try before I buy with some sort of free developer offering
  • A general purpose market with a variety of data available (many vertical- and domain-specific markets are also available if you ever need them)
  • A variety of methods to access the data, ideally including web browser options, charting tools for displaying data trends after I find them in the browser, data dumps whereby I can download the desired data to operate on locally, and most importantly web query language and/or web API options that let me hack on the data living in the market’s servers
  • RESTful API which returns JSON output (XML is my second choice); a YQL binding would be very nice to have as an option, too
  • Bindings for a general purpose language, ideally Python or Java; the more languages are supported via client libraries, the better

Note that I am focusing on general purpose data markets which provide a free (as in beer) public API to access their data. Sometimes these markets are referred to as providing “Data as a Service” (DaaS) or “cloud data“. If a market doesn’t wrap its data via an API, in my opinion it’s making things too difficult for the developer. (Mediocre government data dump sites, I’m looking at you.)

Every market we’ll discuss below contains a variety of data sources and at least a certain level of access available for free so that developers can get started quickly and inexpensively. I’ll be primarily discussing DaaS data catalogs (many of these also contain free public datasets) for the rest of this series of data market articles.

The major data market catalog players

While DataMarket.com is itself a more narrowly focused data catalog and thus not up for consideration for this series given my criteria above, it has provided an excellent overview of the data market competitive landscape on its blog. Click here to read “The Emerging Field of Data Markets” post. All four of the data markets I’ve chosen to discuss further below are outlined in that blog post.

Factual

Factual (@factual) provides a general purpose market with public APIs that developers can start using for free. Factual’s market enables developers to share and reuse data. You can participate in enriching and expanding the available data by both contributing new data and updating existing data. In Grant Nestor’s words:

Factual is actively seeking a “virtuous circle” which benefits everyone.

Factual will in fact cut businesses a deal if they agree to upload some of their own data, or enrichments to Factual data that they use, back into the Factual system.

A different Factual employee described their goal in a company blog post thusly:

We’re trying to weave together every fact we can (and not just from the internet!) into databases that developers can feel confident about building off of and contributing back into.

Factual espouses the importance and power of open data. CEO and Founder Gil Elbaz spoke about their approach and its advantages in a recent interview on the Web 2.0 blog (click here to access the post, then play the embedded audio to hear his discussion with the author).

Factual currently exposes many different datasets via their data market search. Their primary focus to date has been around empowering developers with basic information about businesses and geographic points of interest. In fact one of their better known customers so far has been Facebook, which has loaded in Factual point of interest (POI) data for various countries including the UK and Japan to be used by Facebook Places. Here’s an example of one of the datasets available in Factual, the US POI and Business Listings which currently contains more than 13 million places:

Although Factual does provide a lot of local and business geodata, they also provide a wide variety of other data from many domains including entertainment, education, government, health, and more. For more information on what’s available you can browse the available dataset topics.

Factual offers RESTful API access via their Server API on the Developer Tools page. They also provide CSV download, iPhone SDK, and HTML+CSS+JavaScript web access options, with an Android SDK coming at some point in the future. I’ll discuss these in more detail and show specific examples of using Factual data in the next article in this series.

Freebase

Freebase (@fbase) also provides a data catalog of structured, updatable data akin to Factual. Freebase data spans a wide field of endeavors similar to Factual, and it has many millions of records available for developer use.

One difference between Freebase and Factual lies in Freebase’s entity-based approach. Freebase imposes more structure on the underlying data by assigning unique IDs to identified entities. This video provides a good description of why this is done:

This additional structure makes certain operations simpler, while at the same time making user contributions more difficult. You may benefit as a data user, but have more work to do as a data provider. You have to judge for yourself which of these two approaches is preferable for your data market needs.

Freebase was developed by Metaweb, a commercial entity which was purchased by Google in July 2010. Therefore Freebase is now a Google run service.

You can learn more about Freebase by reading their “What is Freebase?” wiki page (click here). The developer API and Metaweb Query Language (MQL) are described via links from the Freebase Developers wiki page. The wiki also contains a page linking to client libraries designed to make using the Freebase API simpler (for example, click here to read about the Python library).

Click to read the complete article on the PayPal X Developer Network including discussion of Infochimps and Windows Azure Marketplace DataMarket.

Free PayPal Android webinar

Android logo

Just in time for you to prepare to enter the PayPal X Developer Challenge for Android, PayPal has announced they are offering a free “Payments on the Android Platform” webinar on Wednesday, March 30th.

The PayPal X webinars page describes the event thusly:

Android developers — Are you looking for ways to monetize your app? PayPal offers several solutions for enabling payments for applications on the Android platform.  In this webinar, we will show you how quick and easy it is to integrate our mobile payment solutions.  We’ll highlight how the PayPal Mobile Payment Libraries enable in-app payments for a variety of use cases.  You will also learn how PayPal Mobile Express Checkout provides a mobile browser optimized user interface for online stores.

For more information on the speakers and how to register for the webinar, click here to read the complete post on the PayPal X Developer Network.

Notes from the week of 2011-03-06

Apple Event bingo card

PayPal X Platform

Big data

Wireless and mobility

APIs and development

Personal things

Running

Android apps for money and fame

It’s been a while since I wrote about the second PayPal X Developer Challenge.  In the months since we’ve seen mobile applications in general and Android apps in particular skyrocket in terms of consumer and developer interest.

So much so that it should come as no surprise that when PayPal announced their third Developer Challenge this week, it was aimed right at the mobile development sweetspot:  Android apps.

Android challenge logo

There’s some good money for the top three finishers:  A grand prize of $25,000 USD, second prize of $15,000, and third prize of $10,000.  Winners will also receive marketing and PR support from PayPal to help draw eyeballs and fingertips to their apps.   Plus as PayPal points out in The BaldGeek’s announcement of the contest, there’s opportunity for your app to make money on its own since by definition it will have PayPal-based secure payments baked into it from the get go.

PayPal is looking for fresh mobile ideas integrating secure payments.  From the challenge site:

We’re looking for something new, something surprising, something with business potential, something that integrates PayPal payments.

The winning application will make us say, “Wow!” Also accepted:  “Cool!”, “Awesome!”, “I’m totally downloading this”, or “Check this out”.

Click here to read the full post on the PayPal X Developer Network including information on the key deadlines, contest terms, and judging guidelines.

Design a site like this with WordPress.com
Get started