General architecture for the word processing
(Translated and adapted of Wikipedia, english language version)
SPOILS (General for Text Engineering Structures) is a software toolbox written in Java at the university of Sheffield (GB) as from 1995 and used very largely throughout the world by many communities (scientific, companies, teachers, students) for the treatment of the natural language in various languages. The community of developers and researchers around SPOILS is implied in several European research projects like CAT ( Transitioning Applications to Ontologies ) and SEKT ( Semantically Enabled Knowledge Technology ).
SPOIL offer an architecture, an application program interface of applications (API) and a graphic environment of programming.
SPOIL comprises a system of extraction of information, ANNIE ( has Nearly-New Information Extraction System , for quasi new system for the extraction of information), itself formed of modules among which a lexical analyzer, a gazetteer (?), a segmentor of sentences (with clarification), a etiquetor, a module of extraction of named entities and a module of detection of coréférences. The languages for which SPOILS is already implemented are English, Spanish, Chinese, Arabic, French, German, the Hindi, the Cebuano (?), Rumanian, Russian. There exists many plugins of machine learning (Weka, RASP, MAXENT, SVM light), others for the construction of ontologies (WordNet), for the interrogation of search engines like Google and Yahoo, for labelling (Brill, TreeTagger), etc
SPOIL accepts in entry various formats of text like the rough text, HTML, XML, Microsoft Word (Doc.), pdf, like various formats of databases like Java Serial (?), PostgreSQL, Lucene, Oracle, thanks to RDBMS and JDBC (?).
SPOIL also uses language JAPE ( Java Annotation Patterns Engine ) to build rules of annotation of documents. One finds also a debugger and tools of comparison of corpus and annotations.
References
- Official site: group Natural Language Processing of the university of Sheffield
See too
- Unstructured Information Management Structures (UIMA) * Computational linguistics (natural)
- Open source
- Free software