¡Ý Experimental Syntax
One of the main interests in my recent studies is developing a strong research methodology for acceptability judgment testing.
Acceptability judgment testing refers to conducting an experiment to measure how native speakers judge a set of human language sentences. Acceptability judgments of native speakers have played an important part in the study of syntax in that they provide empirical evidence to the theory of generative grammar. Along this line, acceptability judgment testing appeals to many linguists in recent days. It appears that linguists now reach a consensus that the extensive measurement of native speaker's intuition is crucial to examine a syntactic argument.
For more information, see the following papers and data.
[¢º FAQ (2014)]
[¢º Different Ways (2015)] (not published yet)
HuskeyStemmer is a rule-based stemmer for English words. This software was implemented in JAVA, and all source codes and the dictionary file are
freely distributed. The standalone version (jar) is also included. HuskeyStemmer makes use of STDIN/STDOUT.
¡Ý Korean "noun+verb" Idiomatic Compounds
The state-of-the-art skills of computational
linguistics pay attention to lexical semantics, because it has a potential to be used
to improve language processing systems in terms of coverage as well as accuracy.
In particular, utilizing multiword expressions is importantly regarded as one of the
components to foster performance of language applications. Handling these expressions
is particularly crucial in multilingual processing, such as machine translation.
Amongst a variety of multiword expressions, the present study investigates ¡°noun+verb¡±
idiomatic compounds in Korean. These compounds are made up of a verb plus the
verb¡¯s syntactic object, and what the combination of the two words conveys is not
equivalent to the sum of the meanings of the parts. For this reason, the meaning
has to be independently registered in the dictionary as lexical information. This paper
presents how such compounds can be acquired in a fully automatic way.
The results are obtained by exploiting a syntax-annotated corpus (i.e. treebank) and three wordnets
¡Ý Korean Past Morphemes in Bitexts
In human language, we
a single linguistic form often can
be used to convey different meanings. One of the most representative forms
which involve such a mismatch in Korean is the verbal inflectional morpheme
(e/a)ss. This morpheme is responsible for the past tense by default, but in more
than a few cases it does not necessarily denote an event that happened in the
past. This corpus study probes into this mismatch that the past tense morpheme in
Korean (e/a)ss exhibits in a way of comparative corpus linguistics. In order to
create the findings using a data-based method, the present study explores a
bilingual parallel corpus in which a sentence in one language is aligned to the
corresponding sentence in the other language. The parallel data the current work
makes use of is the Sejong English-Korean Bilingual Corpus. Exploring the
parallel data, this corpus study provides a quantitative analysis of (i) which
linguistic form in English the past tense morpheme in Korean corresponds to
and (ii) which verbs are more frequently associated with the mismatch.
¡Ý Multiple Case Marking System in Korean
Exploiting the Sejong
Spoken Corpus, we extracted 1,021 sentences in which the nominative marker
¡®-i/ka¡¯ or the accusative marker ¡®-ul/lul¡¯ occur twice or more. These sentences
were annotated with respect to 47 linguistic parameters, which the previous studies
assume to interact with multiple case-marking constructions. These parameters
are divided into five subgroups: namely, (i) distribution, (ii) semantic relation,
(iii) nominal category, (iv) predication, and (iv) discourse. The constructed data
are numerically analyzed, and the content characteristics are also examined. The
numerical analysis looks into proportion of each parameter and correlation
between two parameters. The content analysis focuses on how multiple
case-marking constructions are realized in naturally occurring conversations. The
whole dataset constructed in this study will be readily distributed in order for
other linguists to use it for their own research purposes.
Xavier is designed for extracting automatically linguistic information from Treebanks. Xavier coded in the ANSI C++ language can furnish the research environment to those who need a general, multi-purpose extraction tool.
¡Ý SMT baselines (Sejong Bilingual Corpora)
These results, taken from Sejong Korean-English and Korean-Japanese Bilingual Corpora, were built up using the GIZA++ and Moses toolkits (factorless 5-gram language model).
¡Ý malign.py with Sejong Korean-Japanese (POS-tagged) Bilingual Corpus
These results were built up using the maign.py module (iteration : 1,000, 5-gram language model).
¡Ý Several Downloadables
¡× Semantic Hierarchy of Korean Adjectives
I built up this in 2006, grounded upon the Yonsei Korean Dictionary. For more information, see my paper.
¡× Toy Parser in JAVA
Rule-based Korean syntactic parser based on six major algorithms (Earley, Tomita, RTN, CKY, Chart, and Left-corner)
¡× Distribution Table of Korean Verbs and Adjectives
This was extracted from the Sejong POS-tagged Corpora (12.5 milion).
¡Ý Online Language Resources
¡× The Sejong Electronic Dictionary - Adjectives
, you can consult various features of Korean adjectives which the Sejong dictionary provides. This page shows the XML data through XSLT. (EUC-KR)
This document aims to introduce WordNet to beginners (e.g. undergraduate students). Therefore, I tried to write this manual as readable as possible.
¡× LOGON Bootable USB
This can be useful, if you're a Windows user but want to make your own linux setting for LKB. You can make your USB storage linux-bootable and execute LKB with the USB storage.
Korean [¢º DOWNLOAD]
English [¢º DOWNLOAD]
¡× Starting LOGON
With the above LOGON bootable USB, you can execute LOGON, test several grammars, or write your grammar.
Korean [¢º DOWNLOAD]