Emsa HTML Tag Remover – Very easy to use tag cleaner program that can be run from a GUI or command line. Activation code: 1760559.
Boost Tokenizer Package – Part of the Boost C++ Library. It contains functions that aid in breaking up strings.
Part of Speech Taggers
TreeTagger – “The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart.” It can also be used for noun, verb, adverb, adjective and prepositional phrase chunking. Linux or Win32 binaries are available. Usable through command line.
Stanford Tagger – A Log-Linear Part of Speech Tagger developed and maintained by Stanford. Usable through the command line and requires Java to run.
Minipar – An efficient 300 words/sec English parser.
Stanford Parser – A statistical parser developed and maintained by Stanford. Uses Java and runs through command line.
WordNet – English Lexicon developed and maintained by Princeton. Contains the meanings and relations of most nouns, verbs, adverbs and adjectives.
Smart Stop Words – A list of words often discarded for efficiency in search engines.
Virtual Box – Open source program for creating virtual machines of operating systems.
Ubuntu – Free Linux based operating system.
Kevin’s Word List – Various word lists and links to other collections of word lists
|Artificial Intelligence Lab
University of Houston-Downtown | Computer & Mathematical Sciences
Home | Publications | Research | Resources | People | News