Research

Large Language Models-based Studies on Chinese Noun Compound Interpretation in progress

  • TBA

Dependency Treebank and Parsing for the Xibe language in progress

  • Universal Dependencies Treebank for Xibe (XDT)
    • We created the very first syntactically annotated treebank for the written Xibe language, which is one of the southern group of Tungusic languages. Sentences are collected from General Introduction to Xibe Grammar by Setuken (锡伯语语法通论, 佘土肯,2009), Cabcal News (ᠴᠠᠯᠴᠠᠯ ᠰᡝᠷᡣᡞᠨ), and Xibe textbook ᠨᡞᠶᠠᠮᠠᡢᡤᠠ ᡤᡞᠰᡠᠨ (3-6).
    • The treebank contains 1,202 trees in total. We will release the first part of 810 trees in Universal Dependencies v2.9.
    • This project is advised by Prof. Sandra Kübler and Prof. Francis Tyers.
    • For more details about the treebank, please refer to our published paper.
  • Cross-lingual Dependency Parsing
    • Human annotation is both time-consuming and labor-intensive. We therefore look for methods to obtain more parsed trees with using the current small Xibe treebank and other UD treebanks in high-resource languages.

Constructing a Multi-genre Treebank of Translated and Non-translated Chinese in progress

  • This project is a collaborated project between Indiana University and Renmin University of China. We aim to compare stylistic and syntactic features between original Chinese and translated Chinese using machine learning methods. Before we get a better understanding of translationese, we create a constituency treebank in multiple genres.
  • I am a member of the treebank group (Hai Hu, Yanting Li, Yina Ma, Zuoyu Tian, Yiwen Zhang).
  • This project is advised by Prof. Charles Chien-Jer Lin and Prof. Sandra Kübler.

Heuristics Analysis for Chinese NLI Systems(Chinese HANS)

  • This project is part of work collaborating with Hai Hu, Yanting Li, Yina Ma, Zuoyu Tian, and Yiwen Zhang.
  • Inspired by the English HANS, we observed biases in the Original Chinese NLI dataset (OCNLI), concluded with surface syntactic heuristic rules and automatically generated more than 2k premise and hypothesis pairs, then we tested the dataset with various monolingual and multilingual pre-trained language models.
  • For more details, please refer to paper

Tone 4 Sandhi in Heze Chinese

  • This project aims to investigate the tonal system in Heze Chinese and its patterns of Tone 4 sandhi
  • We collected speech data produced by native Heze Chinese speakers and extracted tones by running ProsodyPro script
  • Collaborator: Zuoyu Tian, Trey Jagiella
  • The paper won Household Best Paper Award of Department of Linguistics, and presented as a poster at the 24th Annual Mid-Continental Phonetics and Phonology Conference. poster

Similar Language Classification

  • This is a shared task in the VarDial 2019 campaign aiming to distinguish between Mainland Chinese and Taiwan-styled Chinese.
  • Our IUCL system ranked 1st and 2nd on two tracks respectively.
  • Collaborators: Hai Hu, Wen Li, Zuoyu Tian, Yiwen Zhang, Liang Zou
  • For more details, please refer to our system paper.

Chinese-English Rule-based Machine Translation System for Patent Texts

  • This project is sponsored by China Patent Information Center and is collaborated between CPIC and Institute of Chinese Information Processing of Beijing Normal University
  • My work focused on predicate identification in Chinese sentences, including identification and syntax-based reordering of verb phrases and v+n compound nouns. And the work was advised by Prof. Yaohong Jin
  • For more details, please refer to paper