A many years ago, the IF Archive existed, and it was an FTP site. It lived at ftp.gmd.de
. That was a long time ago. (1992, but who's counting.)
Slightly less long ago, the World Wide Web existed, and I said "I bet there could be a web mirror of the FTP site." So I (along with my co-conspirator Paul) built that as a holiday break project. It was just a static mirror of the files, with HTML index pages. I announced it on the rec.arts.int-fiction
newsgroup.
A few days later, someone posted "How Do You Find Anything?"
Fair question. The favored answer was "Download the Master-Index
text file and search through that." Nobody even mentioned the idea of a web search engine.
It wasn't a very complicated archive at that point, though. If you were looking for a game, you went to games/zcode
or games/tads
or whatever and all the games were listed. Hierarchical folders; you probably knew where you were going.
More files and more folders were added over the years, but people still mostly knew their way around. Then Google turned up and that helped a lot. And then in 2007, Mike Roberts launched IFDB, which was an extremely searchable database of IF games -- with links to the IF Archive. So that solved the problem completely!
Mostly. Ish.
IFDB is very comprehensive for games, but it doesn't try to cover interpreters, zines, articles, or the rest of the eclectic material which the Archive has collected over the decades. Google is still okay for this purpose (with the "search web" option and site:ifarchive.org
). But the idea of a locally hosted search tool kept coming up.
Last weekend I finally said, "Eh, how hard could this be?" Answer: not hard at all! I had a first draft working in about two days. Don't I feel silly now?
Behold: the IF Archive search page.
This is still a feature in progress. Should there be a mini-search bar on each page of the site? Should I be sorting results by date? More than ten results per result page? Suggestions welcome.
Here's the code for that search tool. I wound up using a Python search library called Whoosh. Whoosh is pretty old; it hasn't been much updated since 2016. But it works just fine. (There's a more recent fork called Whoosh-Reloaded but I have not dug into that.)
The nice thing about Whoosh is that it's not a web search engine per se. It's a library for full-text search of whatever data you've got. You feed in "documents", which are really just key-value collections. Whoosh builds an on-disk database of that data, and then it can search it very efficiently. I set up a script to feed in that Master-Index
file I mentioned earlier, and Gretchen's your aunt.
(In fact we keep an XML version of Master-Index
. So the problem of parsing that into key-value data is already handled.) (Click on Master-Index.xml
if you like, but it's 15 MB of XML and browsers aren't great at that.)
I had to build a web-service wrapper for Whoosh. Well, hey, I built tinyapp for a different IFTF service; I'll just use that. It's a Flask-alike. (Yeah, I could have used Flask, but tinyapp is a habit now.)
Someone suggested Pagefind as an alternative. This is a client-side search script: a JS widget that loads its index data from the web server. Then there's a static server-side component that generates this index data. Unlike Whoosh, it works by scraping your static HTML.
This is an interesting tradeoff. You save the CPU cost of server-side search, but you pay the bandwidth cost of handing out the search index to each user. (Don't worry, it's segmented, not served in a giant lump.)
However, this model doesn't fit the Archive case very well. The Archive generates HTML files from Master-Index
, so scraping the HTML back in to regenerate the index metadata is just clunky. (And requires a lot of tuning of the HTML.) Also, I want search items to refer to parts of an index page -- there are lots of items per index page. Whoosh doesn't see anything weird about that; the item's URL can have a #fragment
, Whoosh doesn't care. But Pagefind is pretty solidly set on the idea that one page is one index item.
Anyhow, server load doesn't seem to be a problem yet. (Maybe this post will change that!) IFDB search is undoubtedly more useful for most people.
So, for many reasons, Whoosh it was.
(However, I'm considering Pagefind as a search tool for this blog! The DuckDuckGo search you see in the sidebar works okay -- but DDG is ad-supported, and ad-supported services gonna getcha sooner or later.)
While I'm on about the Archive, let me mention the development model.
It's a mess. A deep, historic mess. I originally wrote the Archive's index generator in 1999... in C, bless that child's heart. I converted that nightmare to Python in 2017, looks like, while I was moving the site over to IFTF hosting. But it was just a Python script that ran on some files. I tested it locally by running it on some test files.
Over time, more projects got folded into the Archive setup. There's the upload script, the admin tool, the Unbox service. Now search. Each one has its own git repository. And oh yeah, while I'm in there, shouldn't the front page and stylesheet be under version control too?
It's all pretty well organized on the server. (Except Unbox, which is well-organized on a second server.) However, to get everything installed, I just... shove the files into place. When I update a script or something, I shove it in where it goes. There's no automation at all. Every repo has a private NOTES
or TODO
file where I scribbled down what I did last time.
This is not good devops zen.
A couple of years ago, Dan Fabulich suggested Docker. I am fairly Docker-resistant, but he set up the admin tool repo with a Docker config to test that component (just that one) in a test container. It seemed workable.
I am now slowly constructing a testframe repository which will assemble all the Archive components in a test container. It includes the other repos as submodules. (I know, sigh, submodules.) You'll be able to browse a set of test files, search the files, upload files, etc -- all in a pocket universe, so to speak.
This isn't even slightly ready for road testing yet. The original index generator is the only part that works. I'm still feeling my way through the right way to construct it in Docker. (I've barely even scratched Docker-Compose.) I'm having to tweak many parts of the system to work in a test environment.
To be clear, I'm not working on Dockerizing the actual IF Archive. I am very conservative about making changes in production! The testframe is meant for testing and development, and also as documentation-by-model for how the Archive is configured. Once it's stable... I'll probably put it on the shelf and let the Archive keep running. I love things that just keep running on their own. Big life goal there.
But someday, there will be automation to build the production Archive setup too. As my partner said: this is how archivists deal with mortality.