Text Retrieval
and Mining
# 11-1
Information Extraction
Lecture by Young
Hwan CHO, Ph. D.
Youngcho@gmail.com
Page 2
Plan for Today
Page 3
What is Information Extraction?
an activity or occurrence of interest
such as a terrorist act or an airline crash
a relationship held between two
or more entities
a property of an entity such as
its name, alias, descriptor, or type
an object of interest such as
a person or organization
Definitions
60
Events
70
Facts
80
Attributes
90
Entities
Percentile Reliability
Items of Information
Page 4
IE from the Web: The Big Picture
Page 5
Information Extraction의 컴포넌트
Page 6
Examples : Corpus
Page 7
Examples : Entity
Persons:
Organizations:
Locations:
Artifacts:
Dates:
Fletcher Maddox
UCSD Business School
La Jolla
Geninfo
June 1999
Dr. Maddox
La Jolla Genomatics
CA
Geninfo
Oliver
La Jolla Genomatics
Oliver
L.J.G.
Ambrose
Maddox
Page 8
Examples : Attributes
Fletcher Maddox
Maddox
former Dean of the UCSD Business School
his father
the firm's CEO
PERSON
Oliver
His son
Chief Scientist
PERSON
Ambrose
Oliver's brother
the CFO of L.J.G.
PERSON
UCSD Business School
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
ORGANIZATION
La Jolla Genomatics
L.J.G.
ORGANIZATION
Geninfo
its product
ARTIFACT
La Jolla
the Maddox family's hometown
LOCATION
CA
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
LOCATION
Page 9
Examples : Facts
PERSON
Employee_of
ORGANIZATION
Fletcher Maddox
Fletcher Maddox
Oliver
Ambrose
Employee_of
Employee_of
Employee_of
Employee_of
UCSD Business School
La Jolla Genomatics
La Jolla Genomatics
La Jolla Genomatics
ARTIFACT
Product_of
ORGANIZATION
Geninfo
Product_of
La Jolla Genomatics
LOCATION
Location_of
ORGANIZATION
La Jolla
Location_of
La Jolla Genomatics
CA
Location_of
La Jolla Genomatics
Page 10
Examples : Events
COMPANY:
La Jolla Genomatics
PRINCIPALS:
Fletcher Maddox
Oliver
Ambrose
DATE:
CAPITAL:
COMPANY:
La Jolla Genomatics
PRODUCT:
Geninfo
DATE:
June 1999
COST:
Page 11
Unstructured Data -> Strcutured/Semi-Structured
Data
Page 12
Source Styles
Page 13
Segmentation
Page 14
Clustering + Classification
Page 15
Association
Page 16
Global vs Local Extrations
Page 17
Information Extraction in Real
Page 18
Extracting Corporate Information
Data automatically
extracted from
marketsoft.com
Source web page.
Color highlights
indicate type of
information.
(e.g., red = name)
E.g., information need: Who is the
CEO of MarketSoft?
Source: Whizbang! Labs/
Andrew McCallum
Page 19
Product information
Page 20
Product information
Page 21
Canonicalization: Product information
Page 22
Wrappers
Page 23
Amazon Book Description
….
</td></tr>
</table>
<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>
<font face=verdana,arial,helvetica size=-1>
by <a href="/exec/obidos/search-handle-url/index=books&field-author=
Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a><br>
</font>
<br>
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90
height=140 align=left border=0></a>
<font face=verdana,arial,helvetica size=-1>
<span class="small">
<span class="small">
<b>List Price:</b> <span class=listprice>$14.95</span><br>
<b>Our Price: <font color=#990000>$11.96</font></b><br>
<b>You Save:</b> <font color=#990000><b>$2.99 </b>
(20%)</font><br>
</span>
<p> <br>…
Page 24
Extracted Book Template
Title: The Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
:
:
Page 25
Wrappers: Simple Extraction
Patterns
Page 26
Wrapper induction
Highly regular
source documents
Relatively simple
extraction patterns
Efficient
learning algorithm
Page 27
Use <B>,
</B>, <I>, </I> for extraction
<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
Wrapper induction: Delimiter-based extraction
Page 28
l1,
r1,
…,
lK,
rK
Example: Find 4 strings
<B>, </B>, <I>, </I>
l1 ,
r1 , l2
, r2
labeled
pages
wrapper
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
Learning LR wrappers
Page 29
LR: Finding
r1
<HTML><TITLE>Some
Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
r1
can be any prefix
eg
</B>
Page 30
LR: Finding
l1,
l2
and r2
<HTML><TITLE>Some
Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
r2
can be any prefix
eg </I>
l2 can be any suffix
eg
<I>
l1
can be any suffix
eg <B>
Page 31
Wrapper 생성기 : 전체 흐름도와 도메인 지식 표현
Page 32
Wrapper 생성기
: 전처리 – 논리라인 생성
Page 33
Wrapper 생성기
: 도메인 지식을 이용해서 논리 라인 의미분석
도메인 지식의 각 OBJECT에 대한 패턴을
논리라인으로부터 찾아서
일치하는 FORMAT을 기록한다.
Page 34
XML 규칙
생성
Page 35
Natural Language Processing-based
IE
Page 36
Finite state automata transductions
0
1
2
3
4
PN
’s
ADJ
Art
N
PN
P
’s
Art
John’s interesting
book with a nice cover
Pattern-maching
PN ’s (ADJ)* N P Art (ADJ)*
N
{PN ’s | Art}(ADJ)* N (P Art (ADJ)* N)*
Page 37
Rule-based Extraction Examples
Page 38
Three generations of IE systems
Page 39
Evaluating IE Accuracy
Page 40
MUC: the genesis of IE
Page 41
MUC Information Extraction:
State of the Art c. 1997
NE – named entity recognition
CO – coreference resolution
TE – template element construction
TR – template relation construction
ST – scenario template production
Page 42
Basic IE References
Dun & Bradstreet is the oldest, largest most established seller of business info in the world. They maintain a DB of all 11M US companies, and they do it very inefficiently: phone calls.
We are extracting basic company identification information, like name, address, phone, fax, email from over 10M domain names.
Again, on left, original page, with markup showing where WB extracted the DB fields, which are shown on right.
Again, formatting and position on page is very indicative here. Relative position of entities says something about how they go together---which person with which title, etc.