Since the previous post we’ve succeeded in using tesseract and
we now have a nice plain text version of the EB entry on shakespeare:
http://knowledgeforge.net/shakespeare/svn/trunk/shksprdata/ancillary/britannica-11th.txt
What we now need to do is ‘proof’ this to correct the OCR errors. This
kind of think is perfect for distributed volunteers so if you’d like to
help out just step up and starting correcting with one of the sections. To make it especially easy for people to make edits the text has in a temporary location on the Open Knowledge Foundation wiki (only the first five pages for the time being):
http://okfn.org/wiki/tmp/BritannicaShakespeare
September 19th, 2007
One of next things we want to do for open shakespeare is provide an open
introduction for to his works. The obvious idea for this was to use the
Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as
detailed in this ticket:
http://p.knowledgeforge.net/shakespeare/trac/ticket/24
We’ve now written code to grab the relevant tiffs off wikimedia:
http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py
You can also find them online (28 pages) starting at:
http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF
Next step is to then OCR this stuff (after that we can move on to
proofing whether by ourselves or via http://pgdp.net). When we first had
a stab at this back in April we tried using gocr. Unfortunately the
results were so bad that they were unusable. Recently an old ocr engine
of HP’s has been released as open source under the name of tesseract:
http://code.google.com/p/tesseract-ocr/
We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.
August 14th, 2007
A new version of open shakespeare is out. Get it via the code page:
http://www.openshakespeare.org/code/
Changelog
Outstanding Issues
- Annotation cannot handle long texts because of javascript performance
issues
About Open Shakespeare
A full open set of Shakespeare’s works along with anciallary material, a
variety of tools and a python API.
For more information see the about page:
http://www.openshakespeare.org/about/
Get involved: http://www.openshakespeare.org/participate/
Mailing list: http://lists.okfn.org/mailman/listinfo/okfn-discuss/
April 16th, 2007
After another push over the last few days I’ve got the web annotation system for Open Shakespeare operational (we’ve been hacking on this on and off since back in December).
To see the system in action visit:
http://demo.openshakespeare.org/view?name=phoenix_and_the_turtle_gut&format=annotate
Quite a bit of effort has been made to decouple the annotation system from Open Shakespeare so that it can be easily reused elsewhere. You can find the code for the annotation system (nicknamed annotater) here:
http://p.knowledgeforge.net/shakespeare/svn/annotater/trunk/
There are still some substantial issues with the Open Shakespeare implementation the most obvious of which are:
a) large texts bring the javascript to its knees ((The Phoenix and the Turtle is the shortest of Shakespeare’s works which is why I’m using it).
b) security/user authentication for annotation adding/editing/deleting
But the basic system is working.
April 10th, 2007
Adding annotation support to the texts in Open Shakespeare is the main item for the next 0.4 release. This is a rather large undertaking and the last 2 months has seen substantial work on the first stage in the form of porting Geof Glass’ marginalia into a standalone python package named annotater that can then in turn be easily reused in Open Shakespeare.
The main work in porting annotater was twofold:
- To create and independent annotation store web application which reproduced the restful web interface needed by the marginalia javascript (we’ve also improved this by giving it a normal human-usable CRUD web interface in addition to the restful one)
- Plugging this together (aka debugging/hacking around) with the existing marginalia javascript (for example the paste-based WSGI store web app just would not process posts sent using x-www-form-urlencoded!)
Annotater is now fully functioning and we can entirely reproduce the basic demo in the original marginalia though with the major difference that our version has a proper store backend so all creation/deletion updates of annotations get persisted to a real db and aren’t just in memory (to try this out just start the demo wsgi app via $ python annotater.py).
The next step after this is to integrate annotater into open shakespeare along with doing any polishing up of the package that is needed to achieve this.
February 3rd, 2007
One of the main items scheduled for v0.4 of open shakespeare is improvements to the responsiveness of the concordance. Using the v0.3 codebase, using just the sonnets as test material, loading up the list of words for the concordance alone took around 24s on my laptop. This is because even with a single text there are already over 18,000 items in the concordance and we were having to read through all of these to generate the list of words. Some recent commits (e.g. r:72) have gone some way to improving this responsiveness (loading word list is now 3s now compared to 24s) but the result is not entirely satisfactory (printing full statistics is 13s compared to 40s previously). One obvious way to go futher is to use caching — either of individual web pages or of particular key parts such as all the distinct words occurring in the concordance (caching works because the concordance only changes when new texts are added which will usually only happen once — when the system is first initialised).
Relatedly and r:74 is a first step on filtering the concordance — in this case to exclude roman numerals and various non-words. Doing this made me think about whether the concordance should be storing actual words or just stems — for example, it does not seem to make much sense to have different entries for kill, kills, killed etc. Using a stemming algorithm such as the porter stemmer (which I notice has a nice python implementation directly available) we can easily stem each word as we go along. This would have several benefits one of the most prominent being a dramatic reduction in the basic dictionary size (i.e. the number of distinct words in the concordance).
January 3rd, 2007
We intend to add annotation/commentarysupport to the open shakespeare web demo either in this release or next. As a first step we’ve been looking to see what (open-source) web-based annotation systems are already out there. Below is our list of what we’ve been able to find so far (if you know of more please post a comment). After examining several of these in some detail the one we’re going to try our properly is marginalia (if you’re interested our current efforts to do this including writing a python wsgi annotation service backend can be found here in the subversion repository).
stet: javascript annotation system used for gpl v3 comments system
commentary: javascript based wsgi middleware developed by ian bicking
- http://pythonpaste.org/commentary/
- Rather hacked together (apparently he coded it in a week). Had problems getting it working locally and no documentation to help in adaptation. Seems to be unmaintained (demo site is currently down) which is perhaps not surprising given how many other projects Ian has on the go.
- One nice feature is that you don’t seem to have to mess with the underlying web pages you want to add comments to (this only works if you are sitting on top of another wsgi application)
marginalia: javascript library and spec for adding web annotation to pages
annotea: W3C project based on RDF
- http://www.w3.org/2001/Annotea/
- Been around a long time and now seems to be inactive
- Server and client support rather lacking. No simple interface based on, e.g., javascript — you have to write a special client yourself — which is a major drawback
- That said the protocol is well-documented and so writing a client (or a server) shouldn’t be that hard (other than having to mess around with rdf in javascript …)
- The Schema seems reasonable
- xpointer based which according to the marginalia site is a problem
December 18th, 2006
Today we made the switch from kid to genshi as our templating toolkit in the web interface. Kid has served us well but there are some issues with debugging and including input that can’t be guaranteed to be well-formed. Genshi, as a direct derivative of Kid, delivers very similar syntax but is both simpler and a little more flexible to use.
November 4th, 2006
We’d really like to have some nice images of a shakespeare first folio (if possible from Hamlet) for use in the Open Shakespeare project. However all the scanned copies we’ve managed to find seem to be under full ‘all rights reserved’ copyright.
For example there’s an online version from the Schoenberg Schoenberg Center for Electronic Text and Image at the University of Pennsylvania. But checking the printable version one finds the following:
©2003 Schoenberg Center for Electronic Text and Image
University of Pennsylvania Library.
And this isn’t exceptional. There’s a list of available online folios on:
http://ise.uvic.ca/Library/facsimile/overview/book.html
All of the copies listed are closed (copyrighted with no open license) — with most not allowing for any types of use without permission (the only exception being the State Library of New South Wales which allows for “educational, non-profit, purposes”).
It’s a rather unfortunate situation and it would be great to know if there is a scan of a shakespeare first folio out there which truly is open.
October 15th, 2006
A new version (0.3) of open shakespeare is out. Get it via the code page:
http://www.openshakespeare.org/code/
Changelog
Can now view mutiple texts side by side (ticket:15). See it in action at:
http://demo.openshakespeare.org/view?name=othello_gut_f+othello_gut
Now include moby/bosak versions of shakespeare as well as gutenberg (ticket:10) (though more work remains to be done to process these versions to plaintext and html)
Fix bug whereby we were missing some of the available gutenberg texts (ticket:18)
Install the shakespeare python package (ticket:16)
Move to py.test from unittest
New project website at http://www.openshakespeare.org/
Outstanding Issues
- Several of the source texts (all of them Gutenberg folios) seem to break the viewer due to kid (the templating system) complaining about about ‘not well-formed (invalid token) xml’. Any help in tracking this down would be greatly appreciated.
About Open Shakespeare
A full open set of Shakespeare’s works along with anciallary material, a variety of tools and a python API.
For more information see the about page: http://www.openshakespeare.org/about/
Mailing list: http://lists.okfn.org/mailman/listinfo/okfn-discuss/
October 4th, 2006
Previous Posts