Skip to content

Divining DevZone Insight from Filtered Feeds and Deep Pages: Part 2, The Analysis

February 16, 2011

In the previous article in this series we looked at Devzone developer RSS feeds from which we wanted to harvest data. We discussed our initial approach and then dove into implementing it using Yahoo Pipes and YQL. Please read that post now (click here to access it) if you haven’t already.

We encountered some problems along the way, and promised a solution in this follow-on article. Ready? Here we go!

The RSS property numItems is a good thing

By the end of the previous article, we had created a Yahoo Pipe which used YQL to fetch items from all six of the pertinent DevZone RSS feeds. We discovered that the feed items returned were being severely limited in number by the RSS server. We needed a way to get around that limitation to get as many of the items as possible (preferably, all of them).

After a little searching, I turned up the answer: In order to get more results, I needed to ask the server nicely using the numItems query parameter.

The numItems parameter indicates to the RSS server that I’d like to receive the number of feed items indicated, if possible. If the server is configured to allow more items when asked, I should be able to get more than the default.

Keeping in mind the fact that the DevZone blog feed contained upwards of 170 posts and would increase quite a bit over time, the documents had feed more than sixty and was increasing but at a slower rate, and the pre-cutover individual feeds each contained less than twenty items and weren’t going to increase any more, I constructed the following YQL request to attempt to get every item in every feed:

select * from rss where url in ('https://www.x.com/people/ptwobrussell/blog/feeds/posts?numItems=20', 'https://www.x.com/people/billday/blog/feeds/posts?numItems=20', 'https://www.x.com/people/travis/blog/feeds/posts?numItems=20', 'https://www.x.com/blogs/Ethan/feeds/posts?numItems=20', 'https://www.x.com/community/feeds/blogs?community=2133&numItems=1000', 'https://www.x.com/community/feeds/documents?community=2133&numItems=200')

Running the pipe gave me all of the blog posts from all five blog feeds (yeah!) but none of the article or book excerpt items from the DevZone “Documents” feed (boo!). Now what?

I’ll spare you the details, but suffice it to say that after experimenting with the numItems value for the documents request I found that I could set it as high as ‘38‘ and receive that specified number of items back. If I set it to ‘39‘ or any higher, I got nothing back from that feed. Not nice. Given the need to move on to analyzing the data, however, I decided to roll with all the data I could get and then add back in the twenty-some missing document items later.

So to summarize, at this point I had a Yahoo Pipe that would return most of the desired feed items:

Here is the information for that pipe for you to use or clone as you see fit:

Pipe Feed location Lightweight data
All six feeds, numItems set, date sorted RSS JSON

Developing the ultimate solution

About the time I had the pipe above ready to use, I began noticing some inconsistencies in the data returned using it. Every once in a while, a “Refresh” in the Pipes debugger would fail to load any items, or would load many fewer than was expected. As I was contemplating what to do about that issue, I encountered the final piece of the ultimate solution puzzle: Python YQL, a library for making YQL queries in Python programs.

I had already been bumping into the edges of the Pipes model a bit. Manipulating RSS streams was straightforward, but what if I wanted to save some of the data out for analysis in other tools or archival? And although Pipes does provide a Loop module, some of the nested operations I envisioned during my analysis work would definitely be difficult, if not impossible, if I stayed strictly within the Pipes box.

On the other hand, Python YQL gave me the option of plopping the YQL select statement I’d already developed in Pipes directly into “real code”. Once I had the data flowing into my Python program, I could do just about anything I wanted with it. File I/O, filtering, etc. would be a cake walk in Python. I was sold!

Here then is the plan I implemented to collect, store, organize, analyze, and share the DevZone feed data:

  1. I would create two Python programs, one to conduct the harvest and one to perform analysis; this separation would let me harvest independently of analysis
  2. I would use the YQL developed previously in Pipes to collect the feed data into my Python harvest program, which would save out the portions I needed to a CSV file; this would be the input for the analysis program, and it would also allow me to perform additional analysis with a number of tools (Google Docs or any other tool supporting CSV)
  3. My analysis program would read in the harvested data CSV file along with a separate topic list CSV file; it would then filter the data against each of the DevZone topics, producing a topic-filtered CSV output file for each topic along with a topics statistics CSV file containing the key stats from the topics analysis
  4. I would add in the few missing documents’ data where needed myself (everything above was automated, but this part not so much)
  5. Once I had all of the feed items accounted for, I would create a bit.ly bundle for each topic (using the topic-filtered files) and include those bundle URLs in the topic statistics file
  6. Final step: Explore the topic filtered data and share what I learned in this article and beyond

My Python programs, devzone.harvest.py and devzone.analyze.py, are both available via github. Click here to access the repository and grab a copy of the source.

Python YQL could not be easier to use. We simply import the yql module, get access to a public (non-authenticated) YQL connection, then execute a YQL query against that connection. Here’s a snippet of code from my harvest program showing how easy it is to fetch the document feed data using a YQL select:

import yql
y = yql.Public()
articlequery = "select * from rss where url in ('https://www.x.com/community/feeds/documents?community=2133&numItems=38')"
articles = y.execute(articlequery)

Results are returned as a yql.YQLObj containing rows. Each row contains a dictionary whose value contains key:value pairs for one RSS feed item.

Once I’ve fetched the data from this and the other feeds, I save it out into a devzone.harvest.csv file for use by the analysis program and other tools. Note that as I write each row out via a Python DictWriter named ‘csvwriter‘, I add in a field I’ll use later to indicate if a given item is an article/book excerpt or a blog post. I also do some trimming on the date field to remove the unneeded day of the week and timezone information that was included in the RSS feed’s pubDate fields. Again, the Python code couldn’t be much simpler:

for row in articles.rows:
    row["articleOrBlog"] = "article"
    date = row["pubDate"]
    date = date[5:-4]
    row["pubDate"] = date
    csvwriter.writerow(row)

and here’s a look at the first line of the output devzone.harvest.csv file (note the reserved but currently empty second to last field and that I’ve removed the article HTML content from the final field for brevity):


31 Jan 2011 18:31:26,article,PayPal and the Road to Adaptive Payments,https://www.x.com/docs/DOC-3191,,{content of article would be here}

That’s about it for the interesting bits of the harvest program. You can see the complete source code listing for devzone.harvest.py by clicking here. I’ve tried to comment everything liberally to make it easy to follow along.

Now we’re ready to perform some analysis. Specifically, we want to perform the topic filtering described in the plan steps above. After devzone.analyze.py reads in the devzone.harvest.csv data from the harvest program and saves a copy of it minus the actual item content back out for use in other tools, it’s ready for its own critical bit, the topic filtering:

csvtopics = open("devzone.topics.csv", "rb")
topicreader = csv.reader(csvtopics, dialect='excel')
csvnumitems = open(devzonedir+"devzone.topics.items.csv", "wb")
numitemswriter = csv.writer(csvnumitems, dialect='excel')

for topic in topicreader:
    currenttopic = topic[0]
    topicfile = (currenttopic.replace(' ', '')).replace('.', 'dot')
    csvcurrenttopic = open(devzonedir+"devzone.analysis.topic."+topicfile+".csv", "wb")
    topicwriter = csv.DictWriter(csvcurrenttopic, fieldnames=['pubDate', 'articleOrBlog', 'title', 'link', 'hitCount'], restval='', extrasaction='ignore', dialect='excel')
    csvinput.seek(0)
    items = 0
    for row in itemreader:
        if re.search(currenttopic,row['title']) or re.search(currenttopic,row['description']):
            topicwriter.writerow(row)
            items += 1
    numitemswriter.writerow([currenttopic, items])
    print topicfile, "topic contains", items, "items"
    csvcurrenttopic.close()

Let’s walk through what that code does.

First, it opens up the devzone.topics.csv input file created previously. The topic list I used for this article is available in the github repository (click here). This file lists each of the major multipart blog and article series along with the significant technology and payments groupings that appear in the DevZone content. In effect, it specifies the content categories into which we’re going to slot the various feed items to build our content sitemap. Here are the first few rows of the topics file used for this article:

.Net
Adaptive Accounts
Adaptive Payments
Alternative Ways to Fund Your Project
analytics
Android
{...}

Note that I generated the current topics by hand, refining it over several analysis passes as the topic categories became clear to me. I would like to explore automatically generating this from the content itself in the future (see below for more on that).

For each topic, the analysis program works through each content item, checking to see if that item’s title or content contains the given topic under consideration. A Python regular expression search (re.search) is used for this check. If the topic is mentioned in the item’s title or content, then that item is added to the topic’s topic-specific output file and the topic item count is incremented. At the end of each item row-level pass, the analysis program writes the total number of items for the current topic under consideration. This total is written to devzone.topics.items.csv, which is a key file for our later analysis (more below).

With relatively little code, I.ve been able to do some pretty neat things. I’ve pulled down hundreds of blog posts, articles, and book excerpts from the PayPal server, sliced and diced them including adding some fields for my own use, performed topic filtering, and output several CSV data files for later analysis. The only thing I did beyond that was to manually add in the twenty-six (as of this writing) missing article items where needed. With that the data set is complete and it’s time to look at the results.

Click here to read the full article on the PayPal X Developer Network including the resulting table of DevZone topics and the analysis of the content.

Top 20 DevZone topics

Advertisements

From → Uncategorized

Comments are closed.