analysts, Android, azure, big_data, browser, catalog, CSS, csv, daas, data_dumps, data_markets, digital_goods, embedded_payments, facebook, factual, free, freebase, Google, graphics, html, infochimps, iPhone, java, JavaScript, json, location, marketplace, metaweb, microanalytics, micropayments, microsoft, mosolo, mql, odata, Open Source, paypal, pois, public, python, realtime, Recommended, rest, Wireless, xml, yql
Selling Digital Goods in Data Markets: Part 1, Innovators
Big data and data markets are hot topics these days. I think it’s important to understand a bit about the former in order to best comprehend why we’re seeing skyrocketing interest in the latter.
So what is “big data”? The best definition I’ve encountered to date comes from my friend Mike Loukides from his O’Reilly Radar post “What is data science?” (also available as a PDF report here):
“Big data” is when the size of the data itself becomes part of the problem
As the world generates incredible volumes of new data, the amount available for analysis (turning data into actionable knowledge) increases exponentially. This acceleration in data volume results in more and more problems you might want to solve being “big data” problems.
Want to analyze Middle Eastern dissent and citizen uprisings in Tunisia, Egpyt, and Libya by gulping tweets from the Twitter firehose? Big data problem. Build a location-aware mobile commerce app using phone sensor and cellular network information? Big data, check. Comparative genomic research across multiple genomes, big data, check check. You get the idea.
The difficulty for many developers is that sources of clean, easy to use “big data” are not always apparent. Enter data markets.
Data markets and their many flavors
There’s a fundamental question we need to answer right up front: What is a data market?
In researching this article, I was fortunate enough to spend some time discussing data markets with Grant Nestor of Factual (more on his company later). When I asked Grant to define “data market”, here’s what he said:
A data market is a destination where data is exchanged for other data, money, or things of value.
I like Grant’s definition for two reasons. First, it emphasizes that one can “go” to a data market, i.e. there’s a web presence where you can research and find available sources of data. Second, it points out that we don’t just buy data from markets, but rather that we might also exchange other data or things of value for it. Both of these will be important later as we talk about what features various markets provide.
Data markets can be further divided into several varieties based upon their focus and how they present their available datasets for consumption by developers, analysts, and other users. A recent Strata conference session on “Building and Pricing the Open Data Marketplace” (as reported in these notes from attendee Paul Miller) broke out several types of markets including:
- Data catalogs – markets which pull together links to various datasets; the data may be hosted in the market’s own storage or it may be linked to elsewhere; catalog markets are meant to make it easier to locate datasets of interest, but the data itself may not always as “fresh” as the data provided by the next class of markets
- Realtime feeds – services which provid direct access to constantly updating streams of relevant data a la Gnip’s Twitter feeds; super fresh, but can provide an overwhelming volume of data to store and process if you’re note careful
- Free public data sources – often mandated by governments and NGOs, these provide access to useful data but the data may be poorly structured or “dirty”, making it more difficult for a developer to use
- Graphics oriented services – services meant more for analysts than developers, heavy on built-in visualization tools and spreadsheet support but often lacking in programmatic (API) access
Which of these types of data markets will meet the modern web developer’s needs? I can’t speak to every possible scenario, but I do know what I’m looking for for my own web API oriented development.
Features I want in a data market
For this data market series I’m interested in data markets that will let us explore their offerings cheaply and efficiently. Here then is what I am looking for from an ideal data market:
- The ability to try before I buy with some sort of free developer offering
- A general purpose market with a variety of data available (many vertical- and domain-specific markets are also available if you ever need them)
- A variety of methods to access the data, ideally including web browser options, charting tools for displaying data trends after I find them in the browser, data dumps whereby I can download the desired data to operate on locally, and most importantly web query language and/or web API options that let me hack on the data living in the market’s servers
- RESTful API which returns JSON output (XML is my second choice); a YQL binding would be very nice to have as an option, too
- Bindings for a general purpose language, ideally Python or Java; the more languages are supported via client libraries, the better
Note that I am focusing on general purpose data markets which provide a free (as in beer) public API to access their data. Sometimes these markets are referred to as providing “Data as a Service” (DaaS) or “cloud data“. If a market doesn’t wrap its data via an API, in my opinion it’s making things too difficult for the developer. (Mediocre government data dump sites, I’m looking at you.)
Every market we’ll discuss below contains a variety of data sources and at least a certain level of access available for free so that developers can get started quickly and inexpensively. I’ll be primarily discussing DaaS data catalogs (many of these also contain free public datasets) for the rest of this series of data market articles.
The major data market catalog players
While DataMarket.com is itself a more narrowly focused data catalog and thus not up for consideration for this series given my criteria above, it has provided an excellent overview of the data market competitive landscape on its blog. Click here to read “The Emerging Field of Data Markets” post. All four of the data markets I’ve chosen to discuss further below are outlined in that blog post.
Factual
Factual (@factual) provides a general purpose market with public APIs that developers can start using for free. Factual’s market enables developers to share and reuse data. You can participate in enriching and expanding the available data by both contributing new data and updating existing data. In Grant Nestor’s words:
Factual is actively seeking a “virtuous circle” which benefits everyone.
Factual will in fact cut businesses a deal if they agree to upload some of their own data, or enrichments to Factual data that they use, back into the Factual system.
A different Factual employee described their goal in a company blog post thusly:
We’re trying to weave together every fact we can (and not just from the internet!) into databases that developers can feel confident about building off of and contributing back into.
Factual espouses the importance and power of open data. CEO and Founder Gil Elbaz spoke about their approach and its advantages in a recent interview on the Web 2.0 blog (click here to access the post, then play the embedded audio to hear his discussion with the author).
Factual currently exposes many different datasets via their data market search. Their primary focus to date has been around empowering developers with basic information about businesses and geographic points of interest. In fact one of their better known customers so far has been Facebook, which has loaded in Factual point of interest (POI) data for various countries including the UK and Japan to be used by Facebook Places. Here’s an example of one of the datasets available in Factual, the US POI and Business Listings which currently contains more than 13 million places:
Although Factual does provide a lot of local and business geodata, they also provide a wide variety of other data from many domains including entertainment, education, government, health, and more. For more information on what’s available you can browse the available dataset topics.
Factual offers RESTful API access via their Server API on the Developer Tools page. They also provide CSV download, iPhone SDK, and HTML+CSS+JavaScript web access options, with an Android SDK coming at some point in the future. I’ll discuss these in more detail and show specific examples of using Factual data in the next article in this series.
Freebase
Freebase (@fbase) also provides a data catalog of structured, updatable data akin to Factual. Freebase data spans a wide field of endeavors similar to Factual, and it has many millions of records available for developer use.
One difference between Freebase and Factual lies in Freebase’s entity-based approach. Freebase imposes more structure on the underlying data by assigning unique IDs to identified entities. This video provides a good description of why this is done:
This additional structure makes certain operations simpler, while at the same time making user contributions more difficult. You may benefit as a data user, but have more work to do as a data provider. You have to judge for yourself which of these two approaches is preferable for your data market needs.
Freebase was developed by Metaweb, a commercial entity which was purchased by Google in July 2010. Therefore Freebase is now a Google run service.
You can learn more about Freebase by reading their “What is Freebase?” wiki page (click here). The developer API and Metaweb Query Language (MQL) are described via links from the Freebase Developers wiki page. The wiki also contains a page linking to client libraries designed to make using the Freebase API simpler (for example, click here to read about the Python library).
Click to read the complete article on the PayPal X Developer Network including discussion of Infochimps and Windows Azure Marketplace DataMarket.
From → Uncategorized
Comments are closed.