报告内容安排如下:
1、刘远超
摘要:关键词和文本摘要一样,都是提高信息获取效率的重要途径。关键词自动抽取的研究对于文本分类/聚类以及信息检索等领域也具有重要价值。关键词短语(keyphrase)比关键词(keyword)信息量更加丰富,更能够体现原文的主题。然而目前的问题是关键词短语的标引缺乏统一的规则指导,通过ngram和chunk抽取等方式获取的短语和关键词短语的构成规律也存在不同。利用粗集理论在数据泛化和知识约简方面的优势,对关键词短语的构成规律进行了挖掘,从而获得了中文关键词短语的一般构成规则。挖掘出的规则可以用于自动关键词抽取,也可以对手工关键词标引进行指导。关键词抽取系统通过将关键词短语构成规则与关键词的重要性评价相结合来进行中文文本关键词短语的自动抽取,取得了比较理想的结果。
2、刘桃报告
题目:Domain-Specific Term Extraction and its application in Text Classification
摘要:A statistical method is proposed for domain-specific term extraction from domain comparative corpora. It takes both distribution of a word among domains and distribution of it within a domain into account. Normalization step is added into the extraction process to cope with unbalanced corpora. So it characterizes attributes of domain-specific term more precisely and more effectively than previous approaches. Domain-specific terms are applied in text classification as the feature space. Experiments show that it gains better performance than traditional methods for feature selection.
3、陈燕敏
题目:Research of multi-document summarization.
摘要:Introduce a method to generate summaries for a set of documents.