:::::: Sanghoun Song's Homepage ::::::

(E-MAIL) sanghoun

korea.ac.kr

◎ ZHONG [|]: The HPSG/MRS-based Chinese Resource Grammar

As the texts written in a variety of the Chinese languages gain in importance more and more, the necessity of Chinese processing grows greatly in these days. As is well-known, so-called Chinese refers to a set of various languages including Mandarin Chinese, Cantonese, Min, etc. These languages share a large portion of grammatical components, though their characters and lexical items differ and some syntactic operations also sometimes differ. I'm working on the project to build up an integrated computational grammar for all these Chinese languages within the HPSG and MRS framework: namely, ZHONG [|].
DELPH-IN wiki page
Github for download

◎ Parallel Text Annotation for Information Structure

As with other linguistic investigations, a deep analysis of Information Structure requires the creation of language resources, in which linguistic features related to the phenomena in question are annotated in a fine-grained way. Languages use different phonological, morphological, and syntactic means of marking Information Structure in sentences, and for many languages, the full range of Information Structure marking possibilities remains unknown. Thus, the most comprehensive way of delving into cross-linguistic structuring of information is to analyze multilingual texts. Exploiting multilingual texts allows us to determine how Information Structure strategies in different languages are related to each other, as well as to find systematic methods to identify topics and foci in monolingual texts. This project ultimately aims to provide a fully annotated multilingual treebank that covers Information Structure itself and linguistic domains relevant to Information Structure. The co-annotator of this project is Varya Gracheva (a Russain native speaker). So far, we exploited "The Little Prince" as a running text (see our poster presented in HPSG2011), and recently we turned into "Sherlock Holmes". We're hoping our corpus, which is named "bebo", will be readily and publicly available. You can see our annotation guideline, which is developed from the SFB632 project's guideline.

◎ Korean Resource Grammar

The Korean Resource Grammar (KRG) is an HPSG-based computational grammar, constructed for last several years by Prof. Jong-Bok Kim and Prof. Jaehyung Yang.
Since 2006, I’ve studied the HPSG framework and its implementation for Korean. Since Feb. 2009, I have participated in construction of KRG officially. There are two parts in the current milestone of KRG. One part is for purely research purpose, and the other, which I’m mainly responsible for, aims to build up the Korean HPSG Treebank for MT. You can see the whole project procedure here and you can also try the online demo here.

◎ NARA

NARA is a database of the Sejong Korean-Japanese Bilingual Corpora and a search system based on the database. Using the system, I would like to improve the grammatical accuracy of an individual language in comparison with other language resources, and also write some papers on linguistic generalization between Korean and Japanese. You can consult NARA HERE.

◎ KRF2010

- sponsored by the National Research Foundation of Korea
- as a Research Assistant
- May 2010 ~ Apr. 2011

The question we address in this research is how to acquire the ‘argument structure’ of verbal lexemes in Korean. It is well known that manual build-up of type hierarchy usually cost too much time and resources, so an alternative method, namely automatic collection of relevant information is much more preferred. This research proposes a procedure to automatically collect ARG-ST of Korean verbal lexemes from a Korean Treebank. Specifically, the system we develop first extracts lexical information of ARG-ST of verbal lexemes from a 0.8 million graphic word Korean Treebank in an unsupervised way, checks the hierarchical relationship among them, and builds up the type hierarchy automatically. The result is written in an HPSG-style annotation, thus making it possible to readily implement the result in an HPSG-based parser for Korean.

◎ KRF2008

- sponsored by the Korean Research Foundation
- as a Research Assistant
- Nov. 2008 ~ Oct. 2009

The Relationship between Semantic Similarity and Subcategorization Frames in English: A Stochastic Test Using ICE-GB and WordNet
Our team, COSMOS(Computational Semantics Lab.) tests a working hypothesis that there is a significant relationship between semantic similarity and subcategorization frames in English, under the assumption that if a group of verbs form a cluster sharing a similar meaning, they tend to share subcategorization frames. In the process, we propose a statistical method to test this assumption, making use of two language resources, namely, ICE-GB and WordNet.

◎ SMT using the Sejong Bilingual Corpora

- in NiCT as an internship (in progress)

Currently, machine translation is inclined to a statistical approach, based on the so-called phrase-based model. Data-driven machine translation, machine learning of source-target mapping from bilingual data, began in the early 1990’s. I have tried to apply this approach to Korean processing and implement a basic architecture of English-Korean and Japanese-Korean machine translator, using the Sejong bilingual corpora. The phrase-level alignment normally holds the key position in SMT (Statistical Machine Translation). There are two popular tools to satisfy this purpose; GIZA++ and malign.py.

◎ Xavier

- as a Visiting Research Student at City University of Hong Kong
- 1 person, 2 months

I have studied how to employ treebanks in order to implement NLP systems or grasp linguistic generalization. Those who seek to build up a sort of NLP system grounded on language resources, in recent years, usually make use of treebanks in order to extract a set of lexicon or design linguistically considerable models. Stochastic syntactic parser, for instance, has to be based on probabilistic information that treebanks offer, otherwise, the parser cannot be robustly developed. I implemented a program module to satisfy this purpose; namely, Xavier.
Xavier is designed for extracting automatically linguistic information from Treebanks. Xavier coded in the ANSI C++ language can furnish the research environment to those who need a general, multi-purpose extraction tool.
[▶ DOWNLOAD]