Skip to content

Divining DevZone Insight from Filtered Feeds and Deep Pages – Part 1, The Harvest

February 1, 2011

I’ve been contributing to the PayPal X Developer Network for a little over six months now, and it’s been a lot of fun. I love exploring the PayPal X Platform with you, and in particular focusing on the parts of the platform you find intriguing and most useful.

Near the end of last year, I started wondering which specific topics you were most interested in. How was our DevZone content doing with Developer Network community members? What were the topics you as a group were saying you most wanted to read about? And of those topics, which ones were members actually reading, i.e. what was getting the hits?

I decided to do some analysis of my own blog posts and articles. I collected hit statistics for my 2010 content and sorted it by date, hits, and hits broken out by type (article or blog post). I used that data to write up a summary of my findings. I also surveyed readers on their preferred application development language(s) (click here for the results) and which third party APIs matter the most to them. I am using all of this data to work out what to write for the DevZone in the coming months. One major area of focus that’s clearly emerged: Mobile + social + local APIs, technology, and application development which I’m now collectively referring to and tagging as “MoSoLo“.

As I worked through my own content from 2010, I also started thinking about how I might generalize my analysis to include all of the DevZone content from every contributor. How would I collect the data? What could be automated? And most importantly, what insight could be gained from the exercise?

This article is the result of my investigations into gathering up the pertinent DevZone content. A follow-on article will explore the data to summarize each logical topic and highlight what was learned from the analysis.

What shall we harvest?

I wanted to analyze all of the DevZone blog posts, articles, and book excerpts from the “Blog” and “Documents” RSS feeds. If for some reason those feeds wouldn’t provide me with everything I needed, I decided I would then mine the HTML page versions, in effect the “deep” pages, linked to from the DevZone Blog and Documents pages.

I gathered together a list of all of the content locations. In addition to the DevZone links, I also included blog links for each of the four regular DevZone contributors (Matthew Russell, Travis Robertson, Ethan Winograd, and myself). These individual blog links contain posts made by each of us before the new, all-in-one DevZone feeds were created.

Here then is a table of the HTML pages for each of the six areas to be harvested as well as their RSS feed links.

Source content homepage Feed location Number of items as of 26 January 2010
DevZone blog posts RSS 171 posts
DevZone articles and book excerpts RSS 63 articles and excerpts
Matthew Russell’s blog posts RSS 13 posts from before the cutover to the common DevZone blog feed
Travis Robertson’s posts RSS 11 posts pre-cutover
Ethan Winograd’s posts RSS 3 posts pre-cutover
My posts RSS 14 posts pre-cutover
TOTAL: 275 items

Note that connections to the Developer Network server, and thus to both the pages and the feeds, are secured via HTTPS. This will be important later.

The plan

After gathering links to the content, the next thing I needed to do was figure out how to collect the data together into an analyzable form.

My first inclination was to sort out a mechanism that would allow me to plug in the RSS feeds for all six content sources, pull down all the items from each, sort them by their publication date, and then access and manipulate the resulting stream for analysis. I would pick the most promising tool I could find and then see if it could be used to successfully access and analyze the data. I had very limited time to sort out an automated mechanism, so if my top choice or two didn’t work out straight away, I would fall back to a spreadsheet as my backup plan. Falling back was not desirable from an efficiency and automation standpoint, but I knew it would work if everything else came up short.

My search turned up several dead-ends and then a couple of promising automation possibilities. My plan of attack became:

  1. Use Yahoo Pipes to combine the six RSS feeds into one and then operate on their contents as needed.
  2. If Pipes wasn’t able to handle the task by itself, try using Yahoo! Query Language (YQL).
  3. If Pipes and YQL failed, manually enter data from the article, book excerpt, and blog post pages for each item into a Google Docs spreadsheet for sorting and analysis (this is the same procedure I’d used previously for my own content).

Want to learn how to implement this? Read the full article on the PayPal X Developer Network (click here).

Sneak peak of the implementation discussed in the full article on the PayPal X Developer Network


From → Uncategorized

Comments are closed.