Research

Semantic Relation Representations of Chinese Noun Compounds in Transformer-based Language Models in progress

As a member in Modeling Generalized Event Knowledge for Noun Compound Interpretation and Prediction with Vector Spaces and Transformers (GRF, PolyU15612222)
A Chinese Noun-Noun Compound Dataset with Semantic Relation annotation
Investigation semantic relation encodings in Transformer-based language models

Having completed OCR, preprocessing and bilingual alignment for Red Sorghum, a total of 1.7 million Xibe tokens.
Remaining materials are under processing.

Universal Dependencies Treebank for Xibe (XDT)
- We created the very first syntactically annotated treebank for the written Xibe language, which is one of the southern group of Tungusic languages. Sentences are collected from General Introduction to Xibe Grammar by Setuken (锡伯语语法通论, 佘土肯，2009), Cabcal News (ᠴᠠᠯᠴᠠᠯ ᠰᡝᠷᡣᡞᠨ), and Xibe textbook ᠨᡞᠶᠠᠮᠠᡢᡤᠠ ᡤᡞᠰᡠᠨ (3-6).
- The treebank contains 1,202 trees in total. We will release the first part of 810 trees in Universal Dependencies v2.9.
- This project is advised by Prof. Sandra Kübler and Prof. Francis Tyers.
- For more details about the treebank, please refer to our published paper.
Cross-lingual Dependency Parsing
- Human annotation is both time-consuming and labor-intensive. We therefore look for methods to obtain more parsed trees with using the current small Xibe treebank and other UD treebanks in high-resource languages.

This project is a collaborated project between Indiana University and Renmin University of China. We aim to compare stylistic and syntactic features between original Chinese and translated Chinese using machine learning methods. Before we get a better understanding of translationese, we create a constituency treebank in multiple genres.
I am a member of the treebank group (Hai Hu, Yanting Li, Yina Ma, Zuoyu Tian, Yiwen Zhang).
This project is advised by Prof. Charles Chien-Jer Lin and Prof. Sandra Kübler.

This project is part of work collaborating with Hai Hu, Yanting Li, Yina Ma, Zuoyu Tian, and Yiwen Zhang.
Inspired by the English HANS, we observed biases in the Original Chinese NLI dataset (OCNLI), concluded with surface syntactic heuristic rules and automatically generated more than 2k premise and hypothesis pairs, then we tested the dataset with various monolingual and multilingual pre-trained language models.
For more details, please refer to paper

This project aims to investigate the tonal system in Heze Chinese and its patterns of Tone 4 sandhi
We collected speech data produced by native Heze Chinese speakers and extracted tones by running ProsodyPro script
Collaborator: Zuoyu Tian, Trey Jagiella
The paper won Household Best Paper Award of Department of Linguistics, and presented as a poster at the 24th Annual Mid-Continental Phonetics and Phonology Conference. poster

This is a shared task in the VarDial 2019 campaign aiming to distinguish between Mainland Chinese and Taiwan-styled Chinese.
Our IUCL system ranked 1st and 2nd on two tracks respectively.
Collaborators: Hai Hu, Wen Li, Zuoyu Tian, Yiwen Zhang, Liang Zou
For more details, please refer to our system paper.

This project is sponsored by China Patent Information Center and is collaborated between CPIC and Institute of Chinese Information Processing of Beijing Normal University
My work focused on predicate identification in Chinese sentences, including identification and syntax-based reordering of verb phrases and v+n compound nouns. And the work was advised by Prof. Yaohong Jin
For more details, please refer to paper