文本挖掘技术（04812052）

主讲:	杨建武	yangjw@pku.edu.cn
助教:	洪毅虹	hongyihong@pku.edu.cn
授课时间:	周四 10-12节(18:40--21:30)
授课地点:	二教 316
课程主页:	http://www.icst.pku.edu.cn/course/Mining/12-13Spring/index.html

随着信息技术的发展，信息量爆炸性增长，其中80%是非结构化信息。如何对这些非结构化信息进行有效的分析处理成为学术界和工业界的研究热点。本课程是一门综合性的技术研究课程，涉及数据挖掘、信息检索和自然语言处理等研究领域。课程的学习有助于我院学生了解学术前沿，锻炼基础知识的综合应用能力，对于今后的学术研究或应用技术研发均具有重要意义。

课程考核与作业

课程考核说明: 点击下载

hashtags标注: 点击进入

课程作业提交: 点击提交

课程安排及讲义下载

2013-02-28	第一章	引言
2013-03-07	第二章	文本特征提取技术
2013-03-14	第三章	文本检索技术(part1)
2013-03-21	第三章	文本检索技术(part2)
2013-03-28	第四章	文本自动分类技术
2013-04-04	（空）	（清明节）
2013-04-11	第五章	文本自动聚类技术
2013-04-18	第六章	话题检测跟踪技术
2013-04-25	第七章	文本过滤技术
2013-04-25	第八章	关联分析技术
2013-05-02	（空）	（课程作业）
2013-05-09	第九章	文档自动摘要技术
2013-05-09	第十章	信息抽取技术
2013-05-16	第十一章	智能问答技术
2013-05-23	第十二章	文本情感分析技术
2013-05-30	第十三章	Ontology技术
2013-06-06	第十四章	半结构化文本挖掘方法
2013-06-06	第十五章	文本挖掘工具与应用
2013-6-13	作业报告与复习	课程作业交流报告与复习答疑 (复习PPT)
2013-06-20	考试	时间：晚上6:30－8:30；地点：二教306

课程各章节说明

Ø 第一章：引言（2学时）

n 目的与要求：介绍文本挖掘概念与本课程的主要内容

n 主要内容

u 文本挖掘的概念与研究意义

u 文本挖掘的主要内容

u 本课程的总体安排

Ø 第二章：文本特征提取技术（4学时）

n 目的与要求：如何将文档转换成一组数学特征

n 主要内容

u 分词方法

u 短语识别与命名实体识别

u 向量空间模型与概率模型

u 文档间相似性度量

u 隐语义分析(LSA)

Ø 第三章：文本检索技术（6学时）

n 目的与要求：介绍文本检索的主要类型与技术

n 主要内容

u 信息检索模型

u 检索质量评价

u 倒排索引与全文检索技术

u 全文检索系统介绍

u 高维索引技术

Ø 第四章：文本自动分类技术（3学时）

n 目的与要求：文本自动分类技术介绍

n 主要内容

u 自动分类算法概述

u KNN文本分类

u SVM文本分类

Ø 第五章：文本自动聚类技术（3学时）

n 目的与要求：文本自动聚类技术介绍

n 主要内容

u 聚类技术概述

u 基于划分的K-Means文本聚类及变种

u 层次聚类、密度聚类、网格聚类

u 基于后缀树的文本聚类算法

Ø 第六章：话题检测跟踪技术（3学时）

n 目的与要求：介绍话题检测跟踪技术

n 主要内容

u 话题检测跟踪技术(TDT)相关概念

u 主题检测技术主要算法

u 主题追踪技术主要算法

u TDT评测

Ø 第七章：文本过滤技术（3学时）

n 目的与要求：介绍文本过滤技术

n 主要内容

u 信息过滤概念

u 文本过滤方法

u 邮件过滤

Ø 第八章：关联分析技术（1学时）

n 目的与要求：介绍关联分析技术

n 主要内容

u 关联分析主要算法

u 关联分析在文本分析中的应用

Ø 第九章：文档自动摘要技术（2学时）

n 目的与要求：介绍文本摘要技术

n 主要内容

u 文档摘要的概念

u 文档摘要的评价

u 基本方法

u 多文档摘要

Ø 第十章：信息抽取（3学时）

n 目的与要求：介绍信息提取技术

n 主要内容

u 信息抽取介绍

u 信息提取的基本方法

u HMM & CRF

Ø 第十一章：智能问答（QA）技术（3学时）

n 目的与要求：智能问答(Question & Answering)技术

n 主要内容

u 问答系统的概念与历史

u QA@TREC

u Pattern-based Approach

u Knowledge-Based Approach

u Web-based Approach

Ø 第十二章：文本情感分析技术（3学时）

n 目的与要求：介绍文本情感分析技术

n 主要内容

u 情感计算的概念

u 词或短语的情感倾向

u 文档与句子的情感倾向

u 观点挖掘

Ø 第十三章：Ontology （2学时）

n 目的与要求：介绍本体和语义网的概念

n 主要内容

u 本体（Ontology）的概念

u Semantic Web 与 OWL

u 词网 WordNet

u 知网 HowNet

Ø 第十四章：半结构化文本挖掘方法（1.5学时）

n 目的与要求：介绍一些半结构化文本挖掘方法

n 主要内容

u 半结构化数据概念

u 结构信息的描述与利用

u XML文本挖掘方法

Ø 第十五章：文本挖掘工具与应用（1.5学时）

n 目的与要求：介绍当前文本挖掘工具与典型应用场景

n 主要内容

u 文本挖掘在企业竞争情报系统中的应用

u 文本挖掘在企业客户关系管理中的应用

u 文本挖掘在网上舆情预警系统中的应用

u IBM Text Miner

u SAS Text Miner

u Autonomy IDOL Server

u 方正智思中文文本挖掘平台

Ø 课程作业课堂报告（3学时）

n 目的与要求：课程作业课堂报告交流

n 主要内容

u 学生分别介绍各自小组的课程实习和个人主要工作

n 主要参考文献

u （无）

参考文献

Ø 第一章引言

n Hearst, Untangling Text Data Mining, Proc. of ACL\'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Marylnd, 1999. (URL;.mht)

n Anna Stavrianou,Periklis Andritsos,Nicolas Nicoloyannis. Overview and Semantic Issues of Text Mining. SIGMOD Record, 36(3),23-34,September 2007. (.pdf)

n 维基百科：http://en.wikipedia.org/wiki/Text_mining (.mht)

Ø 第二章特征提取

n 孙茂松、邹嘉彦汉语自动分词研究评述。《当代语言学》2001年第1期 pp. 22-32 (.pdf)

n M. Hearst. TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics , 23 (1): 33-64, March 1997. (.pdf)

n J.G. Conrad and M.H. Utt. A system for discovering relationships by feature extraction from text databases. In SIGIR-94 conference proceedings. (.pdf)

n G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing Management, 24(5):513--523, 1988.

n S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, Indexing by latent semantic analysis, Journal of the Society for Information Science, 41(6), 391-407, 1990. (.pdf)

n svdpack: http://www.netlib.org/svdpack/

n Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning, 42,177–196,2001. (.pdf)

n David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation, Journal of Machine Learning Research, 3,993-1022, 2003. (.pdf)

Ø 第三章文本检索技术

n Smart retrieval system: ftp://ftp.cs.cornell.edu/pub/smart/

n Okapi system: http://www.soi.city.ac.uk/~andym/OKAPI-PACK/index.html

n Lemur Toolkit: http://www.lemurproject.org/

n Lucene: http://jakarta.apache.org/lucene/

n G. Salton, and M. J. McGill, Introduction to Modern information Retrieval. McGraw-Hill, 1983.

n Christian Digout. Metric Techniques for High-Dimensional Indexing Technical Report TR 04-19 September 2004 (.pdf)

Ø 第四章文本自动分类技术

n F. Sebastiani. “Machine learning in automated text categorization.” ACM Computing Surveys, 34(1), pp. 1-47, 2002. (.pdf)

n M. Rogati and Y. Yang. High-performing feature selection for text classification ACM CIKM 2002. (.pdf)

n Tie-Yan Liu, Yiming Yang, Hao Wan, et al, Support Vector Machines Classification with Very Large Scale Taxonomy, SIGKDD Explorations, Special Issue on Text Mining and Natural Language Processing, vol.7, issue.1, pp36~43, 2005. (.pdf)

n 苏金树、张博锋、徐昕，基于机器学习的文本分类技术研究进展软件学报 17(9): 1848-1859, 2006.9 (.pdf)

n 基于统计学习理论的支持向量机算法研究 http://www.youngfan.com/nn/SVM.pdf (.pdf)

n 瓦普尼克（著），张学工（译），统计学习理论的本质清华大学出版社 2004.6

n SVMlight http://svmlight.joachims.org/

n SVMTorch http://bengio.abracadoudou.com/projects/SVMTorch.html

Ø 第五章文本自动聚类技术

n Jiawei Han、Micheline Kamber. Data Mining：Concepts and Techniques 高等教育出版社 2001年（范明、孟小峰等译，数据挖掘概念与技术机械工业出版 2001年8月） (Chapter 8 Cluster Analysis)

n Daniel Fasulo. An analysis of recent work on clustering algorithms. Technical Report UW-CSE-01-03-02, University of Washington,1999. (.pdf)

n Zamir O., Etzioni O. Web Document Clustering: A Feasibility Demonstration, Proceedings of the 19th International ACM SIGIR Conference (SIGIR\'98), 1998. (.pdf)

n Hua-Jun Zeng, etc. Learning to Cluster Web Search Results, SIGIR’04, 2004. (.pdf)

Ø 第六章话题检测跟踪技术

n TDT评测 http://www.nist.gov/speech/tests/tdt/index.htm

n J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proc. of SIGIR Conference on Research and Development in Information Retrieval, 1998. (.pdf)

n Y. Yang, T. Pierce, and J. G. Carbonell. A study on retrospective and on-line event detection. In Proc. of SIGIR Conference on Research and Development in Information Retrieval, 1998. (.pdf)

n Y. Yang and J. Z. et al. Topic-conditioned novelty detection. In Proc. of the SIGKDD international conference on Knowledge discovery and data mining, 2002. (.pdf)

n Jian Zhang, Zoubin Ghahramani and Yiming. Yang. A Probabilistic Model for Online Document Clustering with Application to Novelty Detection. In Proceedings of NIPS 2004, Vancouver, Canada, 2004. (.pdf)

n Zhiwei Li, Bin Wang, Mingjing Li, etc. A Probabilistic Model for Retrospective News Event Detection. In Proc. of the SIGIR Conference on Research and Development in Information Retrieval, 2005. (.pdf)

n D. Frey, R. Gupta, V. Khandelwal, V. Lavrenko, A. Leuski, and J. Allan, "Monitoring the News: a TDT demonstration system", demonstration appearing in Proceedings of the Human Language Technology Conference (HLT), 351-355, 2001. (.pdf)

Ø 第七章文本过滤技术

n TREC评测 http://trec.nist.gov/

n 黄萱菁, 夏迎炬, 吴立德基于向量空间模型的文本过滤系统软件学报 2003,14(3) 435-442 (.pdf)

n Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam and Blaz Zupan，Spam Filtering Using Statistical Data Compression Models，Journal of Machine Learning Research，Dec. 2006. pp 2673--2698 (.pdf)

Ø 第八章关联分析技术

n Jiawei Han、Micheline Kamber. Data Mining：Concepts and Techniques 高等教育出版社 2001年（范明、孟小峰等译，数据挖掘概念与技术机械工业出版 2001年8月） (Chapter 6)

n 钱铁云, 王元珍, 冯小年结合类频率的关联中文文本分类中文信息学报 2004 18(6): 30-36 (.pdf)

n 宋擒豹, 沈钧毅基于关联规则的Web文档聚类算法软件学报 2002 13(03): 417-423 (.pdf)

Ø 第九章文档自动摘要技术

n http://duc.nist.gov/

n http://www.summarization.com/

n http://www.summarization.com/mead/

n J. Kupiec, J. Pedersen, and F. Chen. A.trainable document summarizer. In Proceedings of SIGIR’1995, 68-73, 1995. (.pdf)

n T. Nomoto and Y. Matsumoto. A new approach to unsupervised text summarization. In Proceedings of SIGIR’2001, 26-34, 2001. (.pdf)

n D. R. Radev, H. Y. Jing, M. Stys, and D. Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40: 919-938, 2004. (.pdf)

n D. Radev, J. Otterbacher, A. Winkel, A. Blair-Goldenson. NewsInEssence:Summarizing Online News Topics. Communications of the ACM. Vol. 48, No. 10,(2005) 95-98 (.pdf)

n Xiaojun Wan, Jianwu Yang and Jianguo Xiao. CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations. In Proceedings of SIGIR’2007 (.pdf)

n Xiaojun Wan and Jianwu Yang. Multi-Document Summarization Using Cluster-Based Link Analysis. In Proceedings of SIGIR’2008 (.pdf)

Ø 第十章信息抽取

n http://www-nlpir.nist.gov/related_projects/muc/

n http://www.ai.sri.com/~appelt/ie-tutorial/

n Kushmerick, N., Weld, D., and Doorenbos, B. Wrapper Induction for Information Extraction. PHD thesis 99 (.pdf)

n Kristie Seymore, Andrew McCallum and Roni Rosenfeld. Learning Hidden Markov Model Structure for Information Extraction AAAI-99. (.pdf)

n Fabio Ciravegna. Adaptive Information Extraction from Text by Rule Induction and Generalisation. In IJCAI 2001.(.pdf)

n http://www.inference.phy.cam.ac.uk/hmw26/crf/#software

Ø 第十一章智能问答（QA）技术

n http://trec.nist.gov/data/qamain.html

n Burger, J., Cardie, C., Chaudhri, V. et al. Issues, Tasks and Program Structures to Roadmap Research in Question & Answering (Q&A). TREC 2001 (.pdf)

n Sanda M. Harabagiu and Dan I. Moldovan et al. Answering Complex, List and Context Questions with LCC\'s Question-Answering Server TREC 2001 (.pdf)

n Dumais, S., Banko, M., Brill, E., Lin, J., and Ng, A. Web. Question Answering: Is More Always Better? In. SIGIR, pages 291–298, Aug. 2002 (.pdf)

n Ellen M. Voorhees. Overview of the TREC 2005 question answering track. In Proceedings of the Fourteen Text REtreival Conference (TREC 2005) (.pdf)

Ø 第十二章文本情感分析技术

n Bing Liu “From Web Content Mining to Natural Language Processing”(4. Opinion Mining and Summarization), ACL-2007 Tutorial (.pdf)

n Peter D. Turney and Michael L. Littman "Measuring praise and criticism: Inference of semantic orientation from association", ACM Transactions on Information Systems (TOIS)
21(4):315 - 346, Oct. 2003 (.pdf)

n Bing Liu et al. "Opinion Observer: Analyzing and Comparing Opinions on the Web", In Proceedings of WWW’2005, 342-351, 2005. (.pdf)
　

n Bo Pang and Lillian Lee "Opinion Mining and Sentiment Analysis", Foundations and Trends in Information Retrieval Vol. 2, Nos. 1–2 (2008) 1–135,2008. (.pdf)
　

Ø 第十三章 Ontology技术

n OWL http://www.w3.org/TR/owl-features/

n WordNet http://wordnet.princeton.edu/

n 知网 http://www.keenage.com

n M. Hearst. “Automated discovery of WordNet relations.” In Fellbaum, C., editor, WordNet: An electronic lexical database. MIT Press. 1998. (.pdf)

n 董振东，董强：《知网》(.pdf)

Ø 第十四章半结构化文本挖掘方法

n http://www.w3.org/XML/

n http://www.inex.otago.ac.nz

n http://xmlmining.lip6.fr/

nShlomo Geva, Jaap Kamps, Andrew Trotman (eds.) INEX 2008 Workshop Pre-Proceedings 2008 (.pdf)

n Algergawy, Alsayed, Nayak, Richi, & Mesiti, Marco (2011) XML data clustering: An overview. ACM Computing Surveys (CSUR), 43(4). (.pdf)