revised the wiki page of Similarity Queries

commit: 6fddf00d78c34f146d1c65dfd2fd58ed5d17c89b [log] [tgz]
author: Chen Li <chenli@gmail.com> Wed May 29 23:30:08 2013 -0700
committer: Chen Li <chenli@gmail.com> Wed May 29 23:30:08 2013 -0700
tree: ab299cd7799a579054108d436d67521a62718df7
parent: eedcc40a7ebdfbd58744c22e3897d0a4588e3a69 [diff]
diff --git a/asterix-doc/src/site/markdown/AsterixSimilarityQueries.md b/asterix-doc/src/site/markdown/AsterixSimilarityQueries.md
index dde478c..f9a7edd 100644
--- a/asterix-doc/src/site/markdown/AsterixSimilarityQueries.md
+++ b/asterix-doc/src/site/markdown/AsterixSimilarityQueries.md

@@ -21,8 +21,8 @@
 instance, in our
 [TinySocial](AdmAql101.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
 example, the `friend-ids` of a Facebook user forms a set
-of friends, and we can define a similarity between two sets of
-friends. We can also convert a string to a set of grams of a length q
+of friends, and we can define a similarity between the sets of
+friends of two users. We can also convert a string to a set of grams of a length "n"
 (called "n-grams") and define the Jaccard similarity between the two
 gram sets of the two strings. Formally, the "n-grams" of a string are
 its substrings of length "n". For instance, the 3-grams of the string
@@ -60,7 +60,7 @@
 
 
 AsterixDB allows a user to use a similarity operator `~=` to express a
-similarity condition by defining the similarity function and threshold
+condition by defining the similarity function and threshold
 using "set" statements earlier. For instance, the above query can be
 equivalently written as:
 
@@ -75,15 +75,15 @@
 
 
 In this query, we first declare Jaccard as the similarity function
-using `simfunction` and specify the threshold `0.6f` using
+using `simfunction` and then specify the threshold `0.6f` using
 `simthreshold`.
 
 ## Similarity Join Queries ## 
 
-AsterixDB supports fuzzy joins between two data sets. The following
+AsterixDB supports fuzzy joins between two sets. The following
 [query](AdmAql101.html#Query_5_-_Fuzzy_Join)
 finds, for each Facebook user, all Twitter users with names
-"similar" to their name based on the edit distance.
+similar to their name based on the edit distance.
 
         use dataverse TinySocial;
         
@@ -103,11 +103,14 @@
                                 }
         };
 
-## Using Indexes to Support Queries ## 
+## Using Indexes to Support Similarity Queries ## 
 
-AsterixDB uses a gram-based inverted index (called "ngram") and
-efficient algorithms to support similarity queries.  For a set of
-strings, we generate n-grams for each string, and build an inverted
+AsterixDB uses two types of indexes to support similarity queries, namely
+"ngram index" and "keyword index".
+
+### NGram Index ###
+
+An "ngram index" is constructed on a set of strings.  We generate n-grams for each string, and build an inverted
 list for each n-gram that includes the ids of the strings with this
 gram.  A similarity query can be answered efficiently by accessing the
 inverted lists of the grams in the query and counting the number of
@@ -125,21 +128,29 @@
         
         create index fbUserIdx on FacebookUsers(name) type ngram(3);
 
-
 The number "3" in "ngram(3)" is the length "n" in the grams. This
 index can be used to optimize similarity queries on this attribute
-using [edit distance](AsterixDataTypesAndFunctions.html#edit-distance), or [Jaccard](AsterixDataTypesAndFunctions.html#similarity-jaccard) queries on this attribute where the
+using 
+[edit-distance](AsterixDataTypesAndFunctions.html#edit-distance), 
+[edit-distance-check](AsterixDataTypesAndFunctions.html#edit-distance-check), 
+or [Jaccard](AsterixDataTypesAndFunctions.html#similarity-jaccard) queries on this attribute where the
 similarity is defined on sets of 3-grams.  This index can also be used
 to optimize queries with the "[contains()]((AsterixDataTypesAndFunctions.html#contains))" predicate (i.e., substring
 matching) since it can be also be solved by counting on the inverted
 list of the grams in the query string.
 
-AsterixDB also has an improved version of n-gram index called
-"Partitioned N-Gram Index".  Its main idea is to partition the data
-into groups, and build an n-gram index for the records in each group.
-This partitioned index can be used to further improve query
-performance.  The following is an example to declare such an index.
+### Keyword Index ###
+
+A "keyword index" is also constructed on a set of strings.  Instead of 
+generating grams as in an ngram index, we generate tokens (e.g., words) from strings
+and for each token, construct an inverted list that includes the ids of the
+records with this token.  The follow example shows how to create a keyword index:
 
         use dataverse TinySocial;
-        
-        create index fbUserFuzzyIdx on FacebookUsers(name) type fuzzy ngram(3);
+
+        create index fbUserIdx on FacebookUsers(name) type keyword;
+
+The keyword index can be used to optimize queries with token-based similarity predicates, including
+[similarity-jaccard](AsterixDataTypesAndFunctions.html#similarity-jaccard) and
+[similarity-jaccard-check](AsterixDataTypesAndFunctions.html#similarity-jaccard-check).
+
commit	6fddf00d78c34f146d1c65dfd2fd58ed5d17c89b	[log] [tgz]
author	Chen Li <chenli@gmail.com>	Wed May 29 23:30:08 2013 -0700
committer	Chen Li <chenli@gmail.com>	Wed May 29 23:30:08 2013 -0700
tree	ab299cd7799a579054108d436d67521a62718df7
parent	eedcc40a7ebdfbd58744c22e3897d0a4588e3a69 [diff]