blob: bacd1221b304730507581a00603f360d3d3db1e3 [file] [log] [blame]
Yingyi Bu08953b22016-03-25 15:23:26 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2016-03-25
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20160325" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Yingyi Bu08953b22016-03-25 15:23:26 -070022
Yingyi Bu08953b22016-03-25 15:23:26 -070023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href="http://asterixdb.apache.org/" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
45 <li id="publishDate">Last Published: 2016-03-25</li>
46
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.8.8-incubating</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Documentation</li>
68
69 <li>
70
71 <a href="../install.html" title="Installing and Managing AsterixDB using Managix">
72 <i class="none"></i>
73 Installing and Managing AsterixDB using Managix</a>
74 </li>
75
76 <li>
77
78 <a href="../yarn.html" title="Deploying AsterixDB using YARN">
79 <i class="none"></i>
80 Deploying AsterixDB using YARN</a>
81 </li>
82
83 <li>
84
85 <a href="../aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
86 <i class="none"></i>
87 AsterixDB 101: An ADM and AQL Primer</a>
88 </li>
89
90 <li>
91
92 <a href="../aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
93 <i class="none"></i>
94 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
95 </li>
96
97 <li>
98
Yingyi Bu08953b22016-03-25 15:23:26 -070099 <a href="../aql/datamodel.html" title="Asterix Data Model (ADM)">
100 <i class="none"></i>
101 Asterix Data Model (ADM)</a>
102 </li>
103
104 <li>
105
106 <a href="../aql/manual.html" title="Asterix Query Language (AQL)">
107 <i class="none"></i>
108 Asterix Query Language (AQL)</a>
109 </li>
110
111 <li>
112
113 <a href="../aql/functions.html" title="AQL Functions">
114 <i class="none"></i>
115 AQL Functions</a>
116 </li>
117
118 <li>
119
120 <a href="../aql/allens.html" title="AQL Allen's Relations Functions">
121 <i class="none"></i>
122 AQL Allen's Relations Functions</a>
123 </li>
124
125 <li class="active">
126
127 <a href="#"><i class="none"></i>AQL Support of Similarity Queries</a>
128 </li>
129
130 <li>
131
132 <a href="../aql/externaldata.html" title="Accessing External Data">
133 <i class="none"></i>
134 Accessing External Data</a>
135 </li>
136
137 <li>
138
139 <a href="../feeds/tutorial.html" title="Support for Data Ingestion in AsterixDB">
140 <i class="none"></i>
141 Support for Data Ingestion in AsterixDB</a>
142 </li>
143
144 <li>
145
146 <a href="../udf.html" title="Support for User Defined Functions in AsterixDB">
147 <i class="none"></i>
148 Support for User Defined Functions in AsterixDB</a>
149 </li>
150
151 <li>
152
153 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
154 <i class="none"></i>
155 Filter-Based LSM Index Acceleration</a>
156 </li>
157
158 <li>
159
160 <a href="../api.html" title="HTTP API to AsterixDB">
161 <i class="none"></i>
162 HTTP API to AsterixDB</a>
163 </li>
164 </ul>
165
166
167
168 <hr class="divider" />
169
170 <div id="poweredBy">
171 <div class="clear"></div>
172 <div class="clear"></div>
173 <div class="clear"></div>
174 <a href="https://code.google.com/p/hyracks/" title="Hyracks" class="builtBy">
175 <img class="builtBy" alt="Hyracks" src="../images/hyrax_ts.png" />
176 </a>
177 </div>
178 </div>
179 </div>
180
181
182 <div id="bodyColumn" class="span9" >
183
184 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
185 ! or more contributor license agreements. See the NOTICE file
186 ! distributed with this work for additional information
187 ! regarding copyright ownership. The ASF licenses this file
188 ! to you under the Apache License, Version 2.0 (the
189 ! "License"); you may not use this file except in compliance
190 ! with the License. You may obtain a copy of the License at
191 !
192 ! http://www.apache.org/licenses/LICENSE-2.0
193 !
194 ! Unless required by applicable law or agreed to in writing,
195 ! software distributed under the License is distributed on an
196 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
197 ! KIND, either express or implied. See the License for the
198 ! specific language governing permissions and limitations
199 ! under the License.
200 ! --><h1>AsterixDB Support of Similarity Queries</h1>
201<div class="section">
202<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
203
204<ul>
205
206<li><a href="#Motivation">Motivation</a></li>
207
208<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
209
210<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
211
212<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
213
214<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
215</ul></div>
216<div class="section">
217<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
218<p>Similarity queries are widely used in applications where users need to find records that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
219<div class="section">
220<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
221<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
222<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
223<div class="section">
224<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
225<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
226
227<div class="source">
228<div class="source">
229<pre> use dataverse TinySocial;
230
231 for $user in dataset('FacebookUsers')
232 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
233 where $ed &lt;= 2
234 return $user
235</pre></div></div>
236<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
237
238<div class="source">
239<div class="source">
240<pre> use dataverse TinySocial;
241
242 for $user in dataset('FacebookUsers')
243 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
244 where $sim &gt;= 0.6f
245 return $user
246</pre></div></div>
247<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
248
249<div class="source">
250<div class="source">
251<pre> use dataverse TinySocial;
252
253 set simfunction &quot;jaccard&quot;;
254 set simthreshold &quot;0.6f&quot;;
255
256 for $user in dataset('FacebookUsers')
257 where $user.friend-ids ~= [1,5,9,10]
258 return $user
259</pre></div></div>
260<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
261<div class="section">
262<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
263<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
264
265<div class="source">
266<div class="source">
267<pre> use dataverse TinySocial;
268
269 set simfunction &quot;edit-distance&quot;;
270 set simthreshold &quot;3&quot;;
271
272 for $fbu in dataset FacebookUsers
273 return {
274 &quot;id&quot;: $fbu.id,
275 &quot;name&quot;: $fbu.name,
276 &quot;similar-users&quot;: for $t in dataset TweetMessages
277 let $tu := $t.user
278 where $tu.name ~= $fbu.name
279 return {
280 &quot;twitter-screenname&quot;: $tu.screen-name,
281 &quot;twitter-name&quot;: $tu.name
282 }
283 };
284</pre></div></div></div>
285<div class="section">
286<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
287<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
288<div class="section">
289<h3><a name="NGram_Index"></a>NGram Index</h3>
290<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
291<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
292
293<div class="source">
294<div class="source">
295<pre> use dataverse TinySocial;
296
297 create index fbUserIdx on FacebookUsers(name) type ngram(3);
298</pre></div></div>
299<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
300<div class="section">
301<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
302
303<div class="source">
304<div class="source">
305<pre> use dataverse TinySocial;
306
307 for $user in dataset('FacebookUsers')
308 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
309 where $ed &lt;= 2
310 return $user
311</pre></div></div></div>
312<div class="section">
313<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
314
315<div class="source">
316<div class="source">
317<pre> use dataverse TinySocial;
318
319 for $user in dataset('FacebookUsers')
320 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
321 where $ed[0]
322 return $ed[1]
323</pre></div></div></div>
324<div class="section">
325<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
326
327<div class="source">
328<div class="source">
329<pre> use dataverse TinySocial;
330
331 for $user in dataset('FacebookUsers')
332 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
333 where $sim &gt;= 0.6f
334 return $user
335</pre></div></div></div>
336<div class="section">
337<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
338
339<div class="source">
340<div class="source">
341<pre> use dataverse TinySocial;
342
343 for $user in dataset('FacebookUsers')
344 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
345 where $sim[0]
346 return $user
347</pre></div></div></div>
348<div class="section">
349<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
350
351<div class="source">
352<div class="source">
353<pre> use dataverse TinySocial;
354
355 for $i in dataset('FacebookMessages')
356 where contains($i.message, &quot;phone&quot;)
357 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
358</pre></div></div></div></div>
359<div class="section">
360<h3><a name="Keyword_Index"></a>Keyword Index</h3>
361<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the records with this token. The following two examples show how to create keyword index on two different types:</p>
362<div class="section">
363<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
364
365<div class="source">
366<div class="source">
367<pre> use dataverse TinySocial;
368
369 drop index FacebookMessages.fbMessageIdx if exists;
370 create index fbMessageIdx on FacebookMessages(message) type keyword;
371
372 for $o in dataset('FacebookMessages')
373 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
374 where $jacc[0]
375 return $o
376</pre></div></div></div>
377<div class="section">
378<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
379
380<div class="source">
381<div class="source">
382<pre> use dataverse TinySocial;
383
384 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
385
386 for $c in dataset('FacebookUsers')
387 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
388 where $jacc[0]
389 return $c
390</pre></div></div>
391<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
392 </div>
393 </div>
394 </div>
395
396 <hr/>
397
398 <footer>
399 <div class="container-fluid">
400 <div class="row span12">Copyright &copy; 2016
401 <a href="http://www.apache.org/">The Apache Software Foundation</a>.
402 All Rights Reserved.
403
404 </div>
405
406 <?xml version="1.0" encoding="UTF-8"?>
407<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
408 feather logo, and the Apache AsterixDB project logo are either
409 registered trademarks or trademarks of The Apache Software
410 Foundation in the United States and other countries.
411 All other marks mentioned may be trademarks or registered
412 trademarks of their respective owners.</div>
413
414
415 </div>
416 </footer>
417 </body>
418</html>