In the first installment of this series, I introduced the concept of data markets. I outlined the types of markets and their main features, and then highlighted four of the major data market players.
In this article, I’ll show you how to go about choosing markets to use and then extracting data from them.
Comparing data market features
As a reminder, here are some of the desirable general purpose data market features I discussed in the previous article:
- A free level of developer access
- A variety of data spanning a wide range of topics
- Several different methods to access the data, including at least data dumps (think CSV) and a web API (see the next point for preferred mechanisms)
- RESTful API which returns JSON; also having a YQL binding (click here to learn more about YQL and how to use it) is even better
- General purpose client libraries in your language of choice (for me, Python or Java)
Given my criteria above, the question I need to answer is which of the four markets that we looked at in the first installment of this series best meet my needs?
Here’s a detailed breakdown of the markets addressing each of the desirable features:
| Data Market (alphabetical) | Free Access | Topics and Datasets (as of this writing) | Data Dumps | Web API | Libraries and SDKs |
|---|---|---|---|---|---|
| Factual | Free “as is” API & SDK access, data dumps for developers | 597,620 datasets spanning a wide range of topics | CSV (free, premium datasets available upon request; search Factual for “downloadable” data here) | RESTful with JSON responses (read the Factual documentation including REST API docs) | JavaScript+HTML for web apps, iPhone SDK, Android SDK (under development), FactualR for R developers and researchers, ruby-factual gem, and Net-HTTP-Factual for Perl |
| Freebase | Most of Freebase’s data is available for free under the Creative Commons Attribution license | Millions of entities across hundreds of types and a variety of domains (click to see the schema and search the entity graph) | Full dumps in TSV format (click to access recent) | RESTful with JSON responses (includes MQL queries and web API) | A number of different language librairies (eleven as of this writing including Python and Java) |
| Infochimps | Free “Baboon” level plan (100,000 API calls/month, 2,000 calls/hour burst, attribution required) | 12,995 datasets spanning a wide range of tagged topics including “bigdata“ | Varies for each data set; mostly TSV, CSV, or YAML | RESTful with JSON responses for some datasets (read API documentation); YQL bindings available for a small subset of Twitter-related calls (influence, trstrank, and wordbag) | Ruby, Python, PHP, and command line; Python libs for Twitter dataset operations include infochimpy and python-infochimps |
| Windows Azure Marketplace DataMarket | Some free datasets available, but no easy way to search for them | 86 datasets across a variety of categories | No documentation found including no mention in the FAQ | Uses Microsoft backed OData which provides a consistent API across datasets | Focused around supporting Microsoft-specific technologies (C#, Windows Phone 7, Silverlight, etc.) |
Eliminating data markets that don’t meet our needs
Time to start eliminating market candidates. One of the first things you may notice above is that the Microsoft market contains a much smaller number of datasets than the other data markets we’re considering. This may be because of its relatively young age, as it was only launched in late 2010. Nonetheless, fewer datasets means fewer options in the interesting things you can do with the market. This would be a deal-breaker for me by itself, but in addition it also fails to provide any libraries or examples for non-Microsoft programming languages and tools a la Python. Therefore I’m removing Windows Azure Marketplace DataMarket from further consideration.
The next factor I’m considering is how much structure is imposed by the market on its datasets. I want the freedom to draw my own conclusions from the data rather than being forced into a market’s predetermined taxonomy, hierarchy, or categorization of its contents. Since Freebase is built from the ground up with a heavily structured entity-oriented approach, I’m also eliminating it as an option. Keep in mind that if you want to operate on entities, however, Freebase might be a good choice.
This leaves me with needing to choose between Factual and Infochimps. Both offer a significant amount of free API and data dump options for developers, both have a wide range of datasets available, and both provide a RESTful web API with JSON server responses. It’s possible that both data markets have something I’d like to use, and that’s ok if so. Or else perhaps I’ll decide to stick with just one for my particular application. Let’s explore the application to see what data is available in each market.
Finding data for the problem at hand
At this point it’s useful to look at the needs of our application and use that to search each of the markets for data that may be useful. Obviously your application dictates your dataset choices, so any given market may or may not meet your needs for a particular application. In other words, Your Mileage May Vary.
Given that disclaimer, we do need to pick some sort of example application to make the rest of this article specific enough to be illustrative.
For this particular article’s purposes, I am interested in building an application that uses Twitter user information. I’d like to have access to general purpose user information, i.e. a large dataset across all Twitter users. I’d like to be able to query that user information to figure out important things about users, especially their influence on other Twitter users.
To locate potential datasets of interest, I searched Factual for datasets with “Twitter” in their table name, author, or description across “All Topics” and sorted by “Relevance” (click here to run the search yourself):
As pictured above my search returned a number of datasets including an alphabetical listing of Twitter services and applications, a short table of most-followed network and cable journalists on Twitter, and various other things. Unfortunately it does not return any sort of generalized Twitter user information. There does not appear to be a dataset covering that sort of information in Factual, so let’s turn to Infochimps and see if we can find what we’re looking for.
Performing a similar search of Infochimps (click here to run the search yourself), we get:
Note that Infochimps also uses a system of tags to help you find datasets with certain qualities. You can see a complete list of Infochimps tags here, or click here to see datasets tagged “twitter”.
Reviewing the results from the Infochimps “Twitter” search, you will see a number of free datasets. You’ll also see some datasets that cost money, some that are located offsite (not in the Infochimps system), and some that provide an Infochimps-supported API. We’re most interested in those with an API for this article. Infochimps has tagged API-enabled datasets with the tag “awesomeapi” (click to search):
Let’s search for datasets matching both our “Twitter” and “awesomeapi” interest (click here):
Since we want to get a measure of Twitter influence via an API, let’s add one more thing to our search, “influence”. Click here to execute the search and see the narrowed down results below:
I’m going to use the influence metrics dataset; more on how to access its data below.
Before we move on, however, there’s one more tag you might want to note: “bigdata” denotes large datasets of the kind we discussed in the first article in this series. You can combine awesomeapi+bigdata to get a list of the corresponding big datasets with an API.
Accessing Twitter influence data
Thankfully after all the work to locate the data, actually accessing it is relatively easy. The exact process varies from one data market to another, but you can draw some general conclusions from how Infochimps does it in the particular example we’re using here.
Click here to read the complete article on the PayPal X Developer Network including the details of accessing Twitter influence metrics from the dataset’s page, via REST, and via YQL . A Python example is provided. You may also wish to read a related blog post concerning what I learned from writing this article and a YQL gotcha to avoid.
I just submitted the second article in my current PayPal X DevZone series on data markets (read the first installment here) and want to share a couple of things I learned as a sneak peak of sorts. I also want to call out a potential YQL gotcha I discovered developing this second article.
As a part of the article, I put together a table summarizing the data market features that matter to me. What I learned from that exercise: There is a lot of variability in supported programming languages from market to market.
If you’re considering different markets for your data needs, you should investigate their available libraries, including third party packages, up front. That way you won’t get any surprises after you’ve already committed to a particular dataset (even worse if you had to pay something for that data). For example, I want to use markets that support Python-based development, and not all of the markets I investigated do. Whatever your language(s) of choice, I would encourage you to read the article once it’s published for more details.
Another thing that jumped out at me: It is critically important that a market provides a good search interface to help locate pertinent datasets. Some of the markets I investigate in my article do, others do not. I’ll let you draw your own conclusions after you read the piece, but suffice it to say that I’m partial to the ones that make surfacing datasets simple and quick.
Now on to the YQL gotcha: As part of my article, I developed a simple example that pulls Twitter user influence metrics out of an Infochimps dataset. It does this using the Infochimps provided YQL influence data table. My original YQL statement was naively:
This returns the expected influence metrics when executed in the YQL console:

When you run the same query embedded in Python code using the Python-YQL package, however, you get a “No definition found for Table” error (pictured in the ActivePython Community Edition shell):
Click here for the solution to the YQL problem available in the complete post on the PayPal X Developer Network.
“21 Recipes for Mining Twitter” by Matthew Russell provides readers with a problem-oriented crash course in using Python and freely available third party Python packages to mine social data from Twitter. It assumes familiarity with Python and makes quick progress through extracting and using different types of user and streaming data available from the Twitter API via the twitter package in particular.
I like the general approach of calling out problems to be solved, then addressing them one by one with a “recipe” for each. Some might complain that this approach results in disjointedness from one recipe to the next, but in fact that’s a feature, not a bug. “21 Recipes for Mining Twitter” is actually a spin-off of Russell’s more in depth “Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites” (also an O’Reilly book). The latter shows many more examples, and not just for mining Twitter, but also for harvesting and analyzing social data from many other services and APIs as well. So if you need down-and-dirty recipes for Twitter alone, get this book, and if you need more blanks filled in for Twitter mining and/or information on accessing data from other social services, get the other book. Or heck, get them both!
This book is recommended for: Anyone already familiar with programming who is looking to solve specific problems using Twitter data. It’s a slam dunk for Python programmers.
Recommended with reservations for: Non-programmers interested in social data analysis, with the caveat that they will probably need to spend some time working through the Python.org tutorial or getting up to speed with Python elsewhere before they can make great progress with this book.
Disclaimer: I know the author and have worked with him in various capacities on other projects, but not this book. Even if I didn’t know him, however, I’d still love and make use of this book.
PayPal X Platform
- Free @PayPalX #Android webinar http://bit.ly/hnLfSK (register now for March 30th event) #
- It seems @PayPalX Express Checkout bumps merchant sales up 18% on average http://tcrn.ch/h0Wror via @TechCrunch #
- "Selling Digital Goods in Data Markets: Part 1, Innovators" http://bit.ly/eLy8Ob is my latest @PayPalX @OReillyMedia DevZone article #
- News and links for recently announced @Apigee to Go embeddable consoles http://bit.ly/fJ91Zn including one for @PayPalX Adaptive Payments #
- Following my @PayPalX Apigee-to-Go post http://bit.ly/fJ91Zn @Apigee posted a video to help you get up and running: http://bit.ly/gyKvQZ #
Big data
- Al Jazeera "Region in Turmoil" Twitter dashboard http://bit.ly/f3kxVH via @ptwobrussell #
- Good news for data geeks, bad news for everyone else http://rww.to/hdxMwK via @rww #bigdata #
- New API calls from @infochimps http://rww.to/hnQs9m includes some examples to get you started; via @rww #sxsw #bigdata #
- User info @infochimps http://www.infochimps.com/me (includes Query API key and dashboard link) #
Wireless and mobility
- The Register's article on the collapse of Nokia's technologies http://bit.ly/ggrFDK is the best I've read on why NOK failed #
- Does Starbucks' card mobile program work precisely because they control five key elements described here? http://bit.ly/ee5awH via @mFoundry #
APIs and development
- Facebook Python client lib options http://bit.ly/fMin5S via StackOverflow #
- Using #YQL to obscure your OAuth keys http://bit.ly/98fd6p via @Nectarineimp #
- New @dreasoning Synthesys demo video http://bit.ly/hQ0GjC (entity oriented analytics) #
- Testing code is hard, but data is harder http://bit.ly/erFom9 from the @factual Dev Blog #
- There are now more than 3,000 public APIs which @ProgrammableWeb has used to call out trends http://bit.ly/edXEDO #
- Using HTML 5 to transform WordPress' TwentyTen theme http://bit.ly/eAid6y via @smashingmag #
- ActivePython AMI http://bit.ly/dFT7G2 includes python plus MySQL and SQLite, Apache, Django, Memcached, Nginx, and more for @awscloud apps #
- Visit @LinkedIn Today now at: http://linkd.in/hFe3Yj #
- TechCrunch coverage of @LinkedIn Today launch http://tcrn.ch/edx0Ap #
- Add @GitHub to your @LinkedIn profile http://bit.ly/htzD7x with more discussion from @nytimes http://nyti.ms/f2rx3Q #
Personal things
- Researching pulmonary embolism http://bit.ly/fkY1ov for a family member; thank God for warfarin http://bit.ly/g5DMKQ #
- S#*! my wife says: "Toby Keith is the male Cher" (then sings "God love her" in Jack 2000-imitating-Cher voice) #
Running
- Ran 6.74 miles in 1 hour and 6 mins and 46 secs and felt great. 3-to-1 run, finishing final quarter at 8:16 p… http://dailymile.com/e/O8Se #
- Ran 3.11 miles in 26 mins and felt great. Fast 5k. I "raced" my St Pat's today so I can enjoy running Saturday running with my wife. http://dailymile.com/e/OOTJ #
- Ran 3.28 miles in 44 mins and felt great. Baby stroller and eldest on her scooter to top of hill. http://dailymile.com/e/OTNu #
- Ran 3.11 miles in 26 mins and felt great. Mile split paces 8:45, 8:17, 8:18. http://dailymile.com/e/Obzl #

This month’s newsletter discusses the Post Oak Challenge, among other things.
http://static.issuu.com/webembed/viewers/style1/v1/IssuuViewer.swf
Just in time for you to prepare to enter the PayPal X Developer Challenge for Android, PayPal has announced they are offering a free “Payments on the Android Platform” webinar on Wednesday, March 30th.
The PayPal X webinars page describes the event thusly:
Android developers — Are you looking for ways to monetize your app? PayPal offers several solutions for enabling payments for applications on the Android platform. In this webinar, we will show you how quick and easy it is to integrate our mobile payment solutions. We’ll highlight how the PayPal Mobile Payment Libraries enable in-app payments for a variety of use cases. You will also learn how PayPal Mobile Express Checkout provides a mobile browser optimized user interface for online stores.
For more information on the speakers and how to register for the webinar, click here to read the complete post on the PayPal X Developer Network.

PayPal X Platform
- "PayPal, Apple, and Google fight for your subscriptions" http://bit.ly/icYiRb via the @PayPalx DevZone #
- February developer highlights http://bit.ly/fPFQ7n via @PayPalX DevZone #
- Win $25,000 for your Android app http://bit.ly/e24ca8 via the @PayPalX Developer Challenge (click link for details) #
Big data
- Another nice #Strata conference write-up http://bit.ly/fyx6rt #BigData #
- Cool view into the @dailymile workout data firehose http://bit.ly/el530n (includes link to PubSubHubbub & Google App Engine code example) #
- #RStudio open source IDE for R http://t.co/VvPOVfW discussed at http://t.co/hnaYk6w via @rww #
Wireless and mobility
- Love it! Apple Event Bingo Card http://tcrn.ch/eMMqrT by @johnbiggs #
APIs and development
- Lanyrd is doing interesting things http://bit.ly/fa1vnH for their #SXSW coverage http://bit.ly/i5jwzS #
- I need to do a little research into Open Web Analytics http://bit.ly/dEoA7u features vis-a-vis Google Analytics #
- Wow, I was blown away by the length of this list of languages compilable to JavaScript http://bit.ly/gVkN98 via @pengwynn #
- ChatZilla FAQ http://bit.ly/dEAb8M #
- Selling your by-products (we *all* have them), creating embeddable media, and other tips for product success http://bit.ly/fyQmzi #
- Nice interview of my friend and @SocialWebMining book author Matthew Russell @ptwobrussell on social data http://oreil.ly/gB9hdn #
Personal things
- I'm horribly behind on cross-tweeting my wife's recent @GeekMomBlog, so here goes… #
- From @jennday14 on @GeekMomBlog: PC Cast http://bit.ly/fjsMPN and Stormtrooper sandwiches http://bit.ly/frEFZK #
- From @jennday14 on @GeekMomBlog: Super rolling pin http://bit.ly/dPEhbK and Star Wars cookies http://bit.ly/e14Lcb #
- From @jennday14 on @GeekMomBlog: Paper craft toys http://bit.ly/i24hLr and homemade truffles http://bit.ly/eezDUp #
- From @jennday14 on @GeekMomBlog: Pancake hearts http://bit.ly/erU2Bm and hearts found in nature http://bit.ly/ijIoFw #
- From @jennday14 on @GeekMomBlog: Geek cooking http://bit.ly/gD0eJE and fun without electricity http://bit.ly/i4jfKk #
- From @jennday14 on @GeekMomBlog: Consignment sales http://bit.ly/eoDs4F and "Baby Hears!" book review http://bit.ly/fqaE8c #
Running
- Post Oak Half Marathon: Chip time 2hr 7mins 28secs. Hilly! And you gotta love the wind. http://bit.ly/huaqde #
- A request for the @Garmin team: I'd love it if I could schedule my 405cx to acquire a satellite a few mins before race time. #
- Perfect: Info on training for Halfs + Duathlons http://bit.ly/dUHpfY as I'm starting Tulsa Duathlon http://bit.ly/hsezds training #
- Ran 0.89 miles in 22 mins. Evening stroll with the family as my legs recuperate from Post Oak. Tomorrow, we run! http://dailymile.com/e/Nkid #
- Tips on training for a 50k ultramarathon http://bit.ly/i8jMOy from @runnersworld #
- More ultrarunning tips and resources http://bit.ly/eDhJo0 via Kevin Sayers #
- Ran 4.19 miles in 39 mins and felt great. Tempo run, split paces 9:49, 9:39, 9:16 (peak), then 9:24 (easing). http://dailymile.com/e/NvHR #
- Ran 3.1 miles in 33 mins and felt great. Family run/walk pushing the stroller. http://dailymile.com/e/O3pg #
- Recipe: Healthy mocha-cinnamon pudding http://bit.ly/h0UOeD via @runnersworld #
It’s been a while since I wrote about the second PayPal X Developer Challenge. In the months since we’ve seen mobile applications in general and Android apps in particular skyrocket in terms of consumer and developer interest.
So much so that it should come as no surprise that when PayPal announced their third Developer Challenge this week, it was aimed right at the mobile development sweetspot: Android apps.
There’s some good money for the top three finishers: A grand prize of $25,000 USD, second prize of $15,000, and third prize of $10,000. Winners will also receive marketing and PR support from PayPal to help draw eyeballs and fingertips to their apps. Plus as PayPal points out in The BaldGeek’s announcement of the contest, there’s opportunity for your app to make money on its own since by definition it will have PayPal-based secure payments baked into it from the get go.
PayPal is looking for fresh mobile ideas integrating secure payments. From the challenge site:
We’re looking for something new, something surprising, something with business potential, something that integrates PayPal payments.
The winning application will make us say, “Wow!” Also accepted: “Cool!”, “Awesome!”, “I’m totally downloading this”, or “Check this out”.
Click here to read the full post on the PayPal X Developer Network including information on the key deadlines, contest terms, and judging guidelines.









