randomling | 100 Programming Projects 2: Fansearch

We’re all complex people, and different projects bring out different parts of you. This project is about my fannish interests, my love for the world of fandom and all the things that it involves, and it sort of brings out the librarian in me. (Who knew I had an inner librarian? Certainly not my poor books, scattered haphazardly around the place as they are.)

This is another big one, and I’m nervous to even write about it, tiny and nascent as it is. In fact, I wrote the bulk of this post over two months ago, and am just getting around now to finishing and posting it. But write about it I will.

The Preamble

I recently finished Computer Science 101 with Udacity. It was a great course, really useful to my overall understanding and my grounding in computer science, and the project that the course focused on was building a search engine. (I think that’s an interesting way to teach a course: pick a relevant project and use the project to teach the principles you want to get across. It worked pretty well, too.)

Of course, I don’t want to build another Google, and one of the suggestions for applying our new skills was that, if we wanted to build a search engine at all, to pick a niche area to search.

Searching fandom is a hard problem, and as far as I know, no one has tried to do it. But thinking about it, even as a relative newcomer to programming, I think I can see a way.

The Project

So the idea behind this project is to build a search engine that finds fannish content. And, preferably, only fannish content – for a reasonably broad definition of “fannish”, anyway. I’d love to be able to type terms into a search engine and find the fannish content that matches those terms – without also coming up with all the non-fannish content that matches those search terms on the web. As time goes on, fannish content gets more and more visible on Google and other generic search engines, but I still think it would be good to have a search engine that only indexed pages of fannish interest.

Of course, to do that I have to decide not only how to work out which pages are of fannish interest, but how to tell my search engine to make those decisions for me. It would be impossible, even with a team of hundreds, to make those decisions on a page-by-page basis for every page on the web, unless this job was done automatically.

And then of course I have to build a working search engine which will index, rank and list these pages.

This is not a small task...

The Challenges

So there are many parts to a project like this, and for now I’m choosing to focus on the hardest part – because it’s really at the core of the idea. For Fansearch to be more than just a generic search engine, I need something that can tell whether a piece of text is fannish or not.

This is where having a boyfriend who’s a professional comes in. My boyfriend took the Stanford AI class over the winter, and one of the things he learned to build in the course of that was a Bayesian Classifier. I don’t expect to be able to understand the whole thing yet, but I understand the bare bones. It’s designed to take a piece of text and, using some funky statistical malarkey, decide which of several boxes it belongs in. The really cool part about a Bayesian Classifier is that it learns more with each input you give it, and its decisions can be corrected, so the classification itself can be fine-tuned.

A Bayesian Classifier is going to be at the heart of fansearch – deciding as it crawls what to do with the pages it finds. Fannish pages will be indexed and incorporated into the search engine. Non-fannish pages will be stored for a certain amount of time, until someone has had a chance to review them and decide if there’s any fannish content lurking.

Getting that working is going to be hard enough, because I don’t yet understand the ins and outs of all of this yet – but I’ll get there. I will.

It should also be mentioned that there are important privacy concerns that are attached to any fandom project. How do you make sure you know who wants to be indexed and who doesn't? (How do you tell the difference between people who are happy to be indexed for fannish projects, but less happy to be indexed on the general world wide web?) How do you make sure your technology isn't accidentally "outing" people by linking wallet names to fannish pseudonyms? Privacy is a very real and critical concern for many fans, and of course I wouldn't want to violate that.

My Progress

I am stuck.

This is partly because I have to bug my boyfriend about uploading his code to the Git repository so that I can play with the classifier and see if I can work out how to train it.

It's also partly because this is a huge project and despite having worked out a vague starting-point, I still need to work out what I'm going to attack first and how I'm going to do these things. I need to learn so much more before I'm ready to really start developing this project.

Plus, I have no idea if anyone but me even wants a thing like this. Maybe I should, you know, ask. (That seems to require more of a solid pitch than I actually have, so I'm holding off on it for now.)

I think the next step, when I have the spoons, is to start working out if I can figure out whether this thing I want is wanted by other people. If so, maybe I can begin to put it together. Maybe I can even put together a little team.

We will see.