Ph.D. in Computational Linguistics
(E-MAIL) sanghoun
Experimental Syntax
One of the main interests in my recent studies is developing a strong research methodology for acceptability judgment testing. Acceptability judgment testing refers to conducting an experiment to measure how native speakers judge a set of human language sentences. Acceptability judgments of native speakers have played an important part in the study of syntax in that they provide empirical evidence to the theory of generative grammar. Along this line, acceptability judgment testing appeals to many linguists in recent days. It appears that linguists now reach a consensus that the extensive measurement of native speaker's intuition is crucial to examine a syntactic argument. For more information, see the following papers and data.
[ FAQ (2014)]     [ Different Ways (2015)] (not published yet)    
HuskeyStemmer is a rule-based stemmer for English words. This software was implemented in JAVA, and all source codes and the dictionary file are freely distributed. The standalone version (jar) is also included. HuskeyStemmer makes use of STDIN/STDOUT.
Korean "noun+verb" Idiomatic Compounds
The state-of-the-art skills of computational linguistics pay attention to lexical semantics, because it has a potential to be used to improve language processing systems in terms of coverage as well as accuracy. In particular, utilizing multiword expressions is importantly regarded as one of the components to foster performance of language applications. Handling these expressions is particularly crucial in multilingual processing, such as machine translation. Amongst a variety of multiword expressions, the present study investigates noun+verb idiomatic compounds in Korean. These compounds are made up of a verb plus the verbs syntactic object, and what the combination of the two words conveys is not equivalent to the sum of the meanings of the parts. For this reason, the meaning has to be independently registered in the dictionary as lexical information. This paper presents how such compounds can be acquired in a fully automatic way. The results are obtained by exploiting a syntax-annotated corpus (i.e. treebank) and three wordnets in Korean.
Korean Past Morphemes in Bitexts
In human language, we a single linguistic form often can be used to convey different meanings. One of the most representative forms which involve such a mismatch in Korean is the verbal inflectional morpheme (e/a)ss. This morpheme is responsible for the past tense by default, but in more than a few cases it does not necessarily denote an event that happened in the past. This corpus study probes into this mismatch that the past tense morpheme in Korean (e/a)ss exhibits in a way of comparative corpus linguistics. In order to create the findings using a data-based method, the present study explores a bilingual parallel corpus in which a sentence in one language is aligned to the corresponding sentence in the other language. The parallel data the current work makes use of is the Sejong English-Korean Bilingual Corpus. Exploring the parallel data, this corpus study provides a quantitative analysis of (i) which linguistic form in English the past tense morpheme in Korean corresponds to and (ii) which verbs are more frequently associated with the mismatch.
Multiple Case Marking System in Korean
Exploiting the Sejong Spoken Corpus, we extracted 1,021 sentences in which the nominative marker -i/ka or the accusative marker -ul/lul occur twice or more. These sentences were annotated with respect to 47 linguistic parameters, which the previous studies assume to interact with multiple case-marking constructions. These parameters are divided into five subgroups: namely, (i) distribution, (ii) semantic relation, (iii) nominal category, (iv) predication, and (iv) discourse. The constructed data are numerically analyzed, and the content characteristics are also examined. The numerical analysis looks into proportion of each parameter and correlation between two parameters. The content analysis focuses on how multiple case-marking constructions are realized in naturally occurring conversations. The whole dataset constructed in this study will be readily distributed in order for other linguists to use it for their own research purposes.
[ PAPER]       [ DATA]
Xavier is designed for extracting automatically linguistic information from Treebanks. Xavier coded in the ANSI C++ language can furnish the research environment to those who need a general, multi-purpose extraction tool.
SMT baselines (Sejong Bilingual Corpora)
These results, taken from Sejong Korean-English and Korean-Japanese Bilingual Corpora, were built up using the GIZA++ and Moses toolkits (factorless 5-gram language model).
[ DOWNLOAD]     [ PAPER] with Sejong Korean-Japanese (POS-tagged) Bilingual Corpus
These results were built up using the module (iteration : 1,000, 5-gram language model).
Several Downloadables
Semantic Hierarchy of Korean Adjectives
I built up this in 2006, grounded upon the Yonsei Korean Dictionary. For more information, see my paper.
Toy Parser in JAVA
Rule-based Korean syntactic parser based on six major algorithms (Earley, Tomita, RTN, CKY, Chart, and Left-corner)
Distribution Table of Korean Verbs and Adjectives
This was extracted from the Sejong POS-tagged Corpora (12.5 milion).
Online Language Resources
The Sejong Electronic Dictionary - Adjectives
HERE, you can consult various features of Korean adjectives which the Sejong dictionary provides. This page shows the XML data through XSLT. (EUC-KR)
This document aims to introduce WordNet to beginners (e.g. undergraduate students). Therefore, I tried to write this manual as readable as possible.
LOGON Bootable USB
This can be useful, if you're a Windows user but want to make your own linux setting for LKB. You can make your USB storage linux-bootable and execute LKB with the USB storage.
Korean [ DOWNLOAD]       English [ DOWNLOAD]
Starting LOGON
With the above LOGON bootable USB, you can execute LOGON, test several grammars, or write your grammar.
Korean [ DOWNLOAD]
LATEST UPDATE: October 20, 2015
© 2009 Sanghoun Song