¡Ý ZHONG [|]: The HPSG/MRS-based Chinese Resource Grammar
As the texts written in a variety of the Chinese
languages gain in importance more
and more, the necessity of Chinese processing
grows greatly in these days. As
is well-known, so-called Chinese refers to
a set of various languages including Mandarin
Chinese, Cantonese, Min, etc. These
languages share a large portion of grammatical
components, though their characters
and lexical items differ and some
syntactic operations also sometimes differ.
I'm working on the project to
build up an integrated computational
grammar for all these Chinese languages
within the HPSG and MRS framework:
namely, ZHONG [|].
DELPH-IN wiki page
Github for download
¡Ý Parallel Text Annotation for Information Structure
As with other linguistic investigations, a deep analysis of
Information Structure requires the creation of language
resources, in which linguistic features related to the phenomena
in question are annotated in a fine-grained way. Languages use
different phonological, morphological, and syntactic means of
marking Information Structure in sentences, and for many
languages, the full range of Information Structure marking
possibilities remains unknown. Thus, the most comprehensive way
of delving into cross-linguistic structuring of information is to
analyze multilingual texts. Exploiting multilingual texts allows
us to determine how Information Structure strategies in different
languages are related to each other, as well as to find
systematic methods to identify topics and foci in monolingual
texts. This project ultimately aims to provide a fully annotated
multilingual treebank that covers Information Structure itself
and linguistic domains relevant to Information Structure. The
co-annotator of this project is Varya Gracheva (a Russain native
speaker). So far, we exploited "The Little Prince" as a running
text (see our poster
presented in HPSG2011
recently we turned into "Sherlock Holmes". We're hoping our
corpus, which is named "bebo", will be readily and publicly
available. You can see our annotation guideline
which is developed from the SFB632
¡Ý Korean Resource Grammar
The Korean Resource Grammar (KRG) is an HPSG-based computational grammar, constructed for last several years by Prof. Jong-Bok Kim and Prof. Jaehyung Yang.
Since 2006, I¡¯ve studied the HPSG framework and its implementation for Korean. Since Feb. 2009, I have participated in construction of KRG officially. There are two parts in the current milestone of KRG. One part is for purely research purpose, and the other, which I¡¯m mainly responsible for, aims to build up the Korean HPSG Treebank for MT. You can see the whole project procedure here
and you can also try the online demo here
NARA is a database of the Sejong Korean-Japanese Bilingual Corpora
and a search system based on the database. Using the system, I would like to improve the grammatical accuracy of an individual language in comparison with other language resources, and also write some papers on linguistic generalization between Korean and Japanese. You can consult NARA HERE
- sponsored by the National Research Foundation of Korea
- as a Research Assistant
- May 2010 ~ Apr. 2011
The question we address in this research is how to acquire the ¡®argument structure¡¯ of verbal lexemes in Korean.
It is well known that manual build-up of type hierarchy usually cost too much time and resources, so an alternative method,
namely automatic collection of relevant information is much more preferred.
This research proposes a procedure to automatically collect ARG-ST of Korean verbal lexemes from a Korean Treebank.
Specifically, the system we develop first extracts lexical information of ARG-ST of verbal lexemes from a 0.8 million graphic word Korean Treebank
in an unsupervised way, checks the hierarchical relationship among them, and builds up the type hierarchy automatically.
The result is written in an HPSG-style annotation, thus making it possible to readily implement the result in an HPSG-based parser for Korean.
- sponsored by the Korean Research Foundation
- as a Research Assistant
- Nov. 2008 ~ Oct. 2009
The Relationship between Semantic Similarity and Subcategorization Frames in English: A Stochastic Test Using ICE-GB and WordNet
Our team, COSMOS(Computational Semantics Lab.)
tests a working hypothesis that there is a significant relationship between semantic similarity and subcategorization frames in English, under the assumption that if a group of verbs form a cluster sharing a similar meaning, they tend to share subcategorization frames.
In the process, we propose a statistical method to test this assumption, making use of two language resources, namely, ICE-GB and WordNet.
¡Ý SMT using the Sejong Bilingual Corpora
- in NiCT as an internship (in progress)
Currently, machine translation is inclined to a statistical approach, based on the so-called phrase-based model. Data-driven machine translation, machine learning of source-target mapping from bilingual data, began in the early 1990¡¯s. I have tried to apply this approach to Korean processing and implement a basic architecture of English-Korean and Japanese-Korean machine translator, using the Sejong bilingual corpora
. The phrase-level alignment normally holds the key position in SMT (Statistical Machine Translation). There are two popular tools to satisfy this purpose; GIZA++
- as a Visiting Research Student at City University of Hong Kong
- 1 person, 2 months
I have studied how to employ treebanks in order to implement NLP systems or grasp linguistic generalization. Those who seek to build up a sort of NLP system grounded on language resources, in recent years, usually make use of treebanks in order to extract a set of lexicon or design linguistically considerable models. Stochastic syntactic parser, for instance, has to be based on probabilistic information that treebanks offer, otherwise, the parser cannot be robustly developed. I implemented a program module to satisfy this purpose; namely, Xavier.
Xavier is designed for extracting automatically linguistic information from Treebanks. Xavier coded in the ANSI C++ language can furnish the research environment to those who need a general, multi-purpose extraction tool.