IRS Module 4
IRS Module 4
Text Processing
CHAPTER 4
Syllabus
Metadata, Markup
Text and Multimedia languages and properties:
Document Preprocessing,
Languages, Multimedia; Text Operations:
Document Clustering.
Self-learning Topics: Digital Library :Greenstone
languages.
The document designates a single unit of information.
A document is a piece of text in digital or other form.
A document can be any physical item (a file, an email) or a fully formed
Syntax Text+Structure+
Other media
Semantics
Fig. 4.1.1:Characteristics of a document
TheThe document syntax
expresses structure, presentation style, semantics or external actions
one or more of elements may be given together or implicit in the
document's content
structural element (such as a
section) have fixed
can formatting
style
can be
implicit in its content or expressed in a simple declarative
language or expressed in a programming language
.Fig. 4.1.1 gives all the characteristics of a document.
Syntax languages may be proprietary and specific but open and generic
languages are more flexible.
Text can be written in natural
language, which is difficult for computers
to procesSs.
Large font
Chapter 6
Document and Query Properties and
|Languages Bold font
wth Gonzalo Navarro
6.1 Introduction
Teea is the main form of communicatng knowledge Staring wth
hiercgypts he frst wrten sufaces lone, wood animat sln
papyrus ard r i paper) and paper, has been created
everywhere, in many foms and languages. Weue the tem
dacument to dende a singe unl af irtar matbn, ypialytet in dgtal
foma but &an aso incdb cther meda. In praias, a documet
loosely definedlt can be a complete logical unit, ke a research
artcle, a bock or a marua. t an also be pan da brger ted such
as a paragrah a a sequenae of paragraphsalso caled a pasage
of lend). an ertry d a ddoray, a s cpinon on a case the
descripion of an atamabile part etc Furhermore wth respecd lo
he physicairegresentation, a dacumerl aan be any physikal unit ,for
eample a file, an eail, or a Wodd Wide Web(ar usi Web) page
4.1.1 Metadata
ambiguity.
Examples of Markup Languages
SGML Standard Generalized Markup Language
HTML HyperText Markup Language
o XML: eXtensible Markup Language
SGML
<IATTLIST email
id ID
#REQUIRED
date_sent DATE #REQUIRED
status (secret public) public>
|<!ATTLIST ref
id DREF
<IATTLIST (image | audio )
#REQUIRED>
id D
#REQUIRED>
Fig. 4.1.3: DTD for structuring electronic mails
22-23) (M7-87)
(New Syll. we.f
academic year
Tech-Neo Publicatons
InfomationRetrieval System(MU-Sem.7-1T) (Text Processing) Pg. no. (4-7)
Fig. 4.1.4 shows an example of use of previous DTTD.
Hypertext
Multimedia
The HTML tags folow all the SGML conventions and also include
formatting directives.
HTML pages can contain other media embedded in them, such as images
or audios.
HTML also provides fields for metadata, which can be utilized for
various applications and purposes.
<hro<hr>
p
img align-left src="lower.jpg" height-10 width= 10>
Look at beautiful <b>lower</b>
<body
htm
4.1.3 Multimedia
Text
Image Formats
to specific applications.
Audio
Movies
omultimedia presentations
entertainment and educational titles
HyTime
Hypermedia/Time-based Structuring Language
o Multimedia document markup standard
an SGML architecture that specifies a document's generic
hypermedia structure
( 2 ) Elimination ofstopwords
The main objective is filtering out words with very low discrimination
values for retrieval purposes.
Stopwords are the words
o that are too common
among the documents
which occurs in 80% of the documents
o For example, articles, prepositions, conjunctions, etc.
Stopword elimination significantly shrinks the size of the indexing
structure but may decrease recall
Problem: Search for "to be or not to be"
o Elimination process might leave only the term 'be' which makes
difficult to recognize the documents of that
phrase
(3)(3) Stemming
The major goal is to get rid of affixes (prefixes and suffixes) and make it
possible to retrieve documents with
query terms that have
variants.
syntactie
picnicking -->picnic
stresses-->stress
o king->k
Stems help to enhance retrieval efficiency.
Stemming help to reduce the size of the indexing structure.
Acoonto,
document Automatic o
6paclng
otc Stopworde Noun
groupe Stemming Manual
Text Indexing
otructure Structuro toxt
rcognition
Example
If V(s) =
{connect, connecting, connected} then s=connect
For a given query q:
o D: the local document set, the set
of documents retrieved for a
given queryq
V: local vocabulary, the set of all distinct words in the local
document set
o S1: the set of all distinct stems derived from the set
V
Strategies for building local clusters
(1) Association clusters
(2) Metric clusters
(3) Scalar clusters
Cu,v
d E D,Sj Xfs,-j ..(4.2.1)
C4,v 1
k Vs) kje Vis,) r¬ki, kj)
e ..(4.2.4)
The correlation factor
cuy quantifies the absolute inverse distances
o The association matrix s is unnormalized.
If we adopt,
Su,y Cu
Vis,) IxI V{s,)1 .(4.2.6)
Bulld local metrlc clusters
Su,v ...(4.2.7)
IS xls, 1
Let S(n) be
function which returns the
a
of n largest
set values suy
v#u. Then S,(n) defines a scalar cluster
around the stem s
s Interactive Search Formulation
Stems that belong to clusters associated to the query stems(or terms) can
be used to expand the
original query.
A stem s which
belongs to a cluster (of size n) associated to another
stem s, 1s Said to be a
neighbour of s,.
Fig. 4.2.2 represents a stem s,, as a neighbour of the stem s, within a
neighbourhood S,(n).
X
s,n)
X
X
X
X
X
X
X
X
Ans.
Document-oriented XML retrieval
(1) Document vs. data- centric XML retrieval
(recall)
(2) Focused retrieval
Ans.:
Clustering is a process of partitioning a set of data (or objects) into a set
of meaningful sub-classes, called clusters. Help users understand the natural
grouping or structure in a data set. Used either as a stand-alone tool to get
insight into data distribution or as a preprocessing step for other algorithms.
Chapter Ends..
O00