Open-source Software

Don't see what you're looking for? Try our advanced search.

Other intellectual property is available through USC's innovation-oriented Stevens Center for Innovation.


Carmel is a finite-state transducer package written by Jonathan Graehl at USC/ISI. Carmel includes code for handling finite-state acceptors and transducers, weighted transitions, empty transitions on input and output, composition, k-most likely input/output strings, and both Bayesian (Gibbs sampling) and EM (forward-backward) training.

Tiburon is a tree transducer package written by Jonathan May at USC/ISI. Tiburon is designed to handle weighted regular tree grammars, context-free grammars, and both tree-to-tree and tree-to-string transducers, and can perform composition, intersection, application, determinization, inside/outside training, pruning, return k-most likely trees, Viterbi derivations, and other useful things.

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web
The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

BSL: A system for learning blocking schemes
Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.

EIDOS: Efficiently Inducing Definitions for Online Sources
The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.

Digg 2009
This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.