APPLICATION OF FUZZY LOGIC TO DOCUMENT ARCHIVING
This work is concerned with the development of information retrieval system. These are system that searches through a group of large document in relevance to the user’s demand (query). In the end, the results are indexed according to these queries and the documents retrieved. Different retrieval systems have been researched over time which may include vector space, Boolean, probabilistic etc.
This research work “A fuzzy logic system for archiving purposes” was developed using the concept of fuzzy logic by Professor Lofti A. Zadeh to enhance the precision of information retrieval in archives. Most information retrieval systems have their different retrieval method which might include weight matching, probability etc. But this work utilizes the concept of membership function and fuzzy set theory.
The reason why this method was selected was due to the fact that archives usually contain large document and sometimes the user might not have a perfect idea of what he want to retrieve. Therefore, a perfect tool was designed which allows for the user to just input part of the file or document name he/she is looking for.
The results were then compared to other search and matching systems like the Lucene App developed with Java and does not use fuzzy logic, the Rubens App which uses fuzzy logic and Doc Fetcher retrieved from the internet used for searching files and documents. The model proposed in this work proved far more effective than the aforementioned such that some of the aforementioned software above produced congested results or none at all.
TABLE OF CONTENTS
1.1 PROBLEM DEFINITION
1.2 AIMS & OBJECTIVES
1.3 RESEARCH METHODOLOGY & DESIGN
1.4 SCOPE OF STUDY
1.5 SIGNIFICANCE OF STUDY
1.6 DEFINITION OF TERMS
2.0 LITERATURE REVIEW
2.2 BACKGROUND OF FUZZY LOGIC
2.3 EARLIER MODELS AND PREVIOUS PROPOSITIONS
2.4 EASTERN vs. WESTERN PERSPECTIVE
3.0 RESEARCH METHODOLOGY
3.2 OTHER INFORMATION RETRIEVAL MODELS
3.3 ANALYSIS OF THE PROPOSED MODEL
3.4 MATHEMATICAL REPRESENTATION OF THE MODEL
4.0 DISCUSSION AND FINDINGS
4.1 FINDING THE BEST MEMEBERSHIP FUNCTION
4.2 INDEXING DOCUMENTS ACCORDING TO USER QUERY
4.3 FINDING RELEVANCE LEVEL OF DOCUMENTS
4.4 SELECTING THE BEST DEFFUZIFICATION METHOD
4.5 LANGUAGE CHOICE FOR THE EXPERIMENT
4.6 SYSTEM PROCESS AND CONFIGURATION
5.0 SUMMARY AND CONCLUSION
5.1 FUTURE WORK
Logic in its literal meaning could mean the ability of a system to make a rational decision which can be regarded as the theory of reasoning in decision making. Mathematically, logic generates two results which can be either TRUE or FALSE, 0 or 1, ON or OFF, or any other applicable representation. This concept is referred to as Boolean logic.
Unfortunately, Boolean logic has its limitations. This is due to the fact that it is limited to a set of (0, 1) only, meaning Boolean logic is too precise. This also means that a condition can either be true or false only. For example, Boolean logic cannot differentiate between something that is “good” and that which is “very good”. This limitation is being eliminated by the concept of fuzzy logic.
Fuzzy logic is a branch of logical systems and artificial intelligence. Although it has being studied since 1920,as infinite-valued logics notably by Łukasiewicz and Tarski, the concept was fully developed in 1965 by Lofti A. Zadeh in one of his seminar works regarded as the "fuzzy set theory”. Fuzzy logic is a kind of logic that allows for imprecise or ambiguous answers to questions, forming the basis of computer programming designed to mimic human intelligence (Microsoft Encarta Encyclopedia, 2009). Unlike Boolean logic, fuzzy logic extends its set elements to [0.0, 1.0] and applies membership function to each of the elements contained in the set.
From the above, it could be seen that fuzzy logic compared to Boolean logic, is more complex and it is not too precise, giving a wider range of results to a condition. Rather than mere producing true of false, fuzzy logic can produce very true, true, false, very false. This concept is regarded as degree of truth, where; 0.0 is represented as absolute falseness, and; 1.0 is represented as absolute truth.
Before we go deeper into fuzzy logic, we should not neglect a concept known as defuzzification. Defuzzification is the process of producing a quantifiable result in fuzzy logic, given fuzzy sets and corresponding membership degrees. It is typically needed in fuzzy control systems. These will have a number of rules that transform a number of variables into a fuzzy result, that is, the result is described in terms of membership in fuzzy sets. For example, rules designed to decide how much pressure to apply might result in "Decrease Pressure (15%), Maintain Pressure (34%), and Increase Pressure (72%)". Defuzzification is interpreting the membership degrees of the fuzzy sets into a specific decision or real value.
Fuzzy set theory defines fuzzy operations on fuzzy sets. It uses the feature of human decision making using levels of possibility in a number of uncertain/fuzzy categories. Therefore, fuzzy logic uses IF – Then – Else constructs in the format:
IF variable IS property THEN action
The AND, OR and NOT Boolean logic operators are also used in fuzzy logic usually referred to as MAXIMUM, MINIMUM and COMPLIMENT. They are also referred to as the Zadeh operators. These operators are defined as:
- AND: If Xa is a member of set a, for a measurable variable Xband is a member of set b, for another measurable variable, then the fuzzy AND will be:
A and B = min(X(a), X(b)) or
Xa and b = Xa ^ Xb = Xa * Xb = min (Xa, Xb)
- OR: If Xa is a member of set a, for a measurable variable Xb,and is a member of set b, for another measurable variable, then the fuzzy OR will be:
A or B = max(X(a), X(b)) or
Xa or b = Xa ˅ Xb = Xa + Xb = max (Xa, Xb)
- NOT: For member of set Xa, the fuzzy NOT will be:
NOTa = 1 – X(a) or
X not a = 1 – X(a) = ¬Xa
Fuzzy logic has being applied in many areas which include; medicine, engineering equipment, databases, archives, etc. The application of fuzzy logic in archives is a branch of information retrieval system.
Archiving is a process of compressing large files or data for long term storage. Data of archives usually consists of compressed files having extensions either .zip, .rar etc. Archives mostly contain very old files that are not needed for daily processing but only for reference purposes.An archive is a collection of records containing primary source documents over an individual or organization’s lifetime.
Archiving has many advantages like performance improvement, availability of storage space, reduced maintenance costs, etc. Though, archiving has advantages, organizations cannot archive as they please. An organization needs to have data on the database to a certain period of time before it is archived in order to meet some legal and government requirements.
- An efficient data archiving process can be far more cost effective than using the traditional method of simply adding more storage (disks) and servers.
- Data archives can be used to retrieve information at a later stage if a suspected misdemeanor or criminal act has been suspected. This has become particularly important over recent years due to many incidents of criminal activities, such as drug dealing using companies’ computer resources and even issues around terrorist activities.
- Data archiving systems can compress the information thereby reducing the storage requirements of an organization.
- Data or content archiving systems may automatically ensure that documents or records are not duplicated. Again, the replication of the same information can be a massive overhead on an organization’s resources.
- Mitigation of breaching regulations. Implementing a data archiving system minimizes the risk of being in breach of key codes of practice and other legislation.
The archived data can be made available upon request. In order to make the archived data available it has to be re-loaded in to the online database. But, with NetWeaver 2004s, a new method of archiving called NearLine Storage has come into existence. NearLine Storage acts as an intermediate solution between a traditional archiving and an online database. Using NearLine Storage would allow us to have access to the archived data without the need of reloading the data to online database. There are two types of archives:
- On Line Archiving:is a system whereby the archive system is physically attached to an organization’s network at all the time. It has the benefit in that it is efficient and allows for fast access of archived material and the archiving process can be automated.
- Off Line Archiving: is a system whereby an IT manager would have to archive information from a computer network and then physically move that information to a separate system for retention. The drawback with this is the time and labor required in order to complete this task and also if someone needs to access some archived data, the whole procedure would have to be repeated in reverse.
1.1 PROBLEM DEFINITION
As said earlier, fuzzy logic for archiving purposes is a branch of information retrieval systems. In general, we are faced with the problem of the selection of documentary information from storage in response to search questions (G. Klir et al, 1995). Since Archiving is a very compact way of storing data such that the problem of disk and space management is being reduced, we shall be concerned with the storage, representation, organization and access of information items. The below elaborates more on the possible problems to be encountered with fuzzy logic in archives:
- Although memory wastage is not too much of a concern in archiving system, Archival storage capacity is always a concern since data is, as mentioned above, generally immutable and cannot be deleted until the retention period expires. This requires careful capacity management to ensure that the archive does not run out of space.
- Archives can literarily contain hundreds of gigabytes of unique data making location of files tedious and time consuming. Therefore, a powerful indexing and search capability is required.
- Data duplication could be a very disturbing obstacle in archives such that redundant data could exist in the archive for a longer period of time than expected and can lead to data inconsistency.
- The retrieved documents have to be ranked in order of their significance with respect to the user query.
- Inability to clarify the degree of usefulness of a document in an archive.
1.2 AIMS & OBJECTIVES
Due to the difficulties encountered in maintaining archives, and also inability to classify documents properly with their level of significance and membership functions, the aims of this research would be:
- Matching mechanism is softened to a partial matching: computes the degree of relevance of each document to the user query, on the basis of membership values of the query term in document representations.
- Proper data representation to differentiate properly which data belongs to which set (archive) and also use fuzzy logic operations to note which is a member, a partial member, not a member etc.
- Archives should also be implemented with well-defined data retention and deletion policies in place. Archived data must often be available for retrieval over years -- even decades -- so retention is important to meet compliance and legal obligations. Retention periods can vary by file type and may be set in metadata during the file archiving process and generally cannot be changed until deletion.
- Due to the large size of archives, operation on the archive will naturally slow down. Therefore, a proper index and search mechanism will be implemented to speed up file search and retrieval.
1.3 RESEARCH METHODOLOGY & DESIGN
There are three basic groups of retrieval models which are:
o Standard Boolean model e.g. OPACs (Online Public Access Catalogs)
o Fuzzy Logic model, e.g. Inquiry Assistant at BielefeldUniversity(www.ub.uni-bielefeld.de/databases/rechercheassistent/)
o Vector Space model, e.g. SMART(Salton et al, 1971)
- Probabilistic (Van Rijsbergen et al, 1979)
o Probability theory-based model, e.g. OKAPI, (Robertson and SparckJones et al, 1976)
The methodology to be used is the set theoretic model that will implement fuzzy logic and will have the following components:
- User Interface for query and result: Allows the user to input a query and view the result set.
- Query interpreter: Processes the query in a manner similar to the documents.
- Indexer module: Creates the index, which enables faster searching.
- Matching mechanism: Determines if a document is relevant or not.
- Documents and document representations: The actual pieces of information and their logical view.
Fig 1.0A Representation of Information Retrieval Methodology
An Information Retrieval modelis a quadruple <D, Q, F, R>where
- Dis a set of representations for the documents in the collection.
- Qis a set of representations for the user information needs (queries).
- Fis a framework for modeling document representations, queries, and their relationships.
- R: Q×D→R is a ranking function which associates a real number with a query qi∈Qand document representation dj∈D.
Traditional Fuzzy Document Representation (Salton and McGill et al, 1989)
Function F, defined in the following way:F: DXT → [0,1].
F(d,t)changes from a crisp set value (either 0 or 1) to a continuous membership value in the range [0,1].
Index term weight: the degree of “aboutness”of a document with respect to a term, expressed by value F(d,t), also interpreted as the significance of term in representing the document content.
F(d,t) = tfdt * idft
tfdt: frequency of term tin document d:
tfdt: (number of occurrences of term t in document d / number of occurrences of the most frequent term t in document d)
idft: Inverse document frequency of term t:
idft: log (total number of documents in a collection / number of documents containing term t)
- with the number of occurrences within a document
- with the rarity of the term across the whole document
1.4 SCOPE OF STUDY
This project will only cover areas of files arrangement, search optimization, assignment of membership function to elements and other fuzzy operations on archives. This means it will be limited to the manipulation of archives using fuzzy operations, determining the membership level and significance of documents in an archive.This research work will only implement the above and is not meant to cover how the archives are originally created.
1.5 SIGNIFICANCE OF STUDY
The purpose of this research is to improve file retrieval in archives and to eliminate the “too precise” results produced by Boolean logic. This project work should produce more ambiguous results and flexibility. The significance of using fuzzy logic will be to eliminate:
- Oversimplified representation of the information items (documents).
- No formal means for qualifying the role and degree of the terms in characterizing document contents.
- Matching mechanism only based on the evaluation of the presence of a given search term in the document representation.
- No way of establishing the degree of usefulness of each single document.
- Problems with Boolean Operators:
o Disjunctive (OR) queries lead to information overload by too many results.
o Conjunctive (AND) queries lead to reduced, and commonly zero result.
o Conjunctive queries imply reduction in Recall.
- Query language gives users only a crisp way of specifying their information needs: term is either definitely significant or completely useless.
- No discriminating power in ordinary logic.
1.6 DEFINITION OF TERMS
- Logic: means the ability of a system to make a rational decision which can be regarded as the theory of reasoning in decision making
- Fuzzy logic: a branch of logic that applies degree of truth to ordinary Boolean logic by assigning membership functions to elements of a set.
- Degree of truth: the process of assigning significance to members of a set.
- Membership function: a function that clearly specifies (though imprecise) the level of membership of an element in a set
- Artificial Intelligence: a branch of computer of science that develops programs to allow machines to perform functions normally requiring human intelligence.
- Defuzzification: is the process of producing a quantifiable result in fuzzy logic, given fuzzy sets and corresponding membership degrees.
- Archive: is a collection of documents e.g. letters, photographs in a long term storage for future reference.
- Query: is a formal representation of the request of the information needed from a storage system e.g. in databases, archives etc.
- Information retrieval: is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information (G. Salton et al, Fuzzy Information Retrieval, 1968).
- Index: is a data structure that improves the speed of data retrieval operations.
- Search: the operation that involves the location of required data using specific search queries or conditions.The ability to retrieve results based on a specified criterion
- Redundancy: the duplication of data which makes the data inconsistent thereby rendering it useless.
- Set: is a collection of well-defined and distinct objects, considered as an object on its own.
- Fuzzy sets: are functions that map a value, which might be a member of a set, to a number between zero and one, indicating its actual degree of membership.
- GUI: means a graphical user interface that allows users to interact with a program, application or system containing graphics (combination of images, text, videos, flash etc.).
- Model: A model is a representation or an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact.
- Optimization: is used to enhance the effectiveness and performance of search activities in information retrieval.
- Ranking: assigning priority to search results.
- Index term weight:the degree of “aboutness” of a document with respect to a term, expressed by value F(d,t), also interpreted as the significance of term in representing the document content.