blob: e6acadac31a9878ed68b4891db4ba8eed2631b3b [file] [log] [blame]
Ian Maxon444ca1b2017-08-25 11:41:41 -07001<!DOCTYPE html>
2<!--
Ian Maxon7a4bed92017-09-15 02:01:18 +02003 | Generated by Apache Maven Doxia at 2017-09-14
Ian Maxon444ca1b2017-08-25 11:41:41 -07004 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
Ian Maxon7a4bed92017-09-15 02:01:18 +020010 <meta name="Date-Revision-yyyymmdd" content="20170914" />
Ian Maxon444ca1b2017-08-25 11:41:41 -070011 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Ian Maxon444ca1b2017-08-25 11:41:41 -070022
Ian Maxon444ca1b2017-08-25 11:41:41 -070023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href=".././" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
Ian Maxon7a4bed92017-09-15 02:01:18 +020045 <li id="publishDate">Last Published: 2017-09-14</li>
Ian Maxon444ca1b2017-08-25 11:41:41 -070046
47
48
Ian Maxon7a4bed92017-09-15 02:01:18 +020049 <li id="projectVersion" class="pull-right">Version: 0.9.2</li>
Ian Maxon444ca1b2017-08-25 11:41:41 -070050
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Get Started - Installation</li>
68
69 <li>
70
71 <a href="../ncservice.html" title="Option 1: using NCService">
72 <i class="none"></i>
73 Option 1: using NCService</a>
74 </li>
75
76 <li>
77
78 <a href="../ansible.html" title="Option 2: using Ansible">
79 <i class="none"></i>
80 Option 2: using Ansible</a>
81 </li>
82
83 <li>
84
85 <a href="../aws.html" title="Option 3: using Amazon Web Services">
86 <i class="none"></i>
87 Option 3: using Amazon Web Services</a>
88 </li>
89
90 <li>
91
92 <a href="../yarn.html" title="Option 4: using YARN">
93 <i class="none"></i>
94 Option 4: using YARN</a>
95 </li>
96
97 <li>
98
99 <a href="../install.html" title="Option 5: using Managix (deprecated)">
100 <i class="none"></i>
101 Option 5: using Managix (deprecated)</a>
102 </li>
103 <li class="nav-header">AsterixDB Primer</li>
104
105 <li>
106
107 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
108 <i class="none"></i>
109 Option 1: using SQL++</a>
110 </li>
111
112 <li>
113
114 <a href="../aql/primer.html" title="Option 2: using AQL">
115 <i class="none"></i>
116 Option 2: using AQL</a>
117 </li>
118 <li class="nav-header">Data Model</li>
119
120 <li>
121
122 <a href="../datamodel.html" title="The Asterix Data Model">
123 <i class="none"></i>
124 The Asterix Data Model</a>
125 </li>
126 <li class="nav-header">Queries - SQL++</li>
127
128 <li>
129
130 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
131 <i class="none"></i>
132 The SQL++ Query Language</a>
133 </li>
134
135 <li>
136
137 <a href="../sqlpp/builtins.html" title="Builtin Functions">
138 <i class="none"></i>
139 Builtin Functions</a>
140 </li>
141 <li class="nav-header">Queries - AQL</li>
142
143 <li>
144
145 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
146 <i class="none"></i>
147 The Asterix Query Language (AQL)</a>
148 </li>
149
150 <li>
151
152 <a href="../aql/builtins.html" title="Builtin Functions">
153 <i class="none"></i>
154 Builtin Functions</a>
155 </li>
156 <li class="nav-header">API/SDK</li>
157
158 <li>
159
160 <a href="../api.html" title="HTTP API">
161 <i class="none"></i>
162 HTTP API</a>
163 </li>
164
165 <li>
166
167 <a href="../csv.html" title="CSV Output">
168 <i class="none"></i>
169 CSV Output</a>
170 </li>
171 <li class="nav-header">Advanced Features</li>
172
173 <li>
174
175 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
176 <i class="none"></i>
177 Support of Full-text Queries</a>
178 </li>
179
180 <li>
181
182 <a href="../aql/externaldata.html" title="Accessing External Data">
183 <i class="none"></i>
184 Accessing External Data</a>
185 </li>
186
187 <li>
188
189 <a href="../feeds/tutorial.html" title="Support for Data Ingestion">
190 <i class="none"></i>
191 Support for Data Ingestion</a>
192 </li>
193
194 <li>
195
196 <a href="../udf.html" title="User Defined Functions">
197 <i class="none"></i>
198 User Defined Functions</a>
199 </li>
200
201 <li>
202
203 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
204 <i class="none"></i>
205 Filter-Based LSM Index Acceleration</a>
206 </li>
207
208 <li class="active">
209
210 <a href="#"><i class="none"></i>Support of Similarity Queries</a>
211 </li>
212 </ul>
213
214
215
216 <hr class="divider" />
217
218 <div id="poweredBy">
219 <div class="clear"></div>
220 <div class="clear"></div>
221 <div class="clear"></div>
222 <a href=".././" title="AsterixDB" class="builtBy">
223 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
224 </a>
225 </div>
226 </div>
227 </div>
228
229
230 <div id="bodyColumn" class="span9" >
231
232 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
233 ! or more contributor license agreements. See the NOTICE file
234 ! distributed with this work for additional information
235 ! regarding copyright ownership. The ASF licenses this file
236 ! to you under the Apache License, Version 2.0 (the
237 ! "License"); you may not use this file except in compliance
238 ! with the License. You may obtain a copy of the License at
239 !
240 ! http://www.apache.org/licenses/LICENSE-2.0
241 !
242 ! Unless required by applicable law or agreed to in writing,
243 ! software distributed under the License is distributed on an
244 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
245 ! KIND, either express or implied. See the License for the
246 ! specific language governing permissions and limitations
247 ! under the License.
248 ! --><h1>AsterixDB Support of Similarity Queries</h1>
249<div class="section">
250<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
251
252<ul>
253
254<li><a href="#Motivation">Motivation</a></li>
255
256<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
257
258<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
259
260<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
261
262<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
263</ul></div>
264<div class="section">
265<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
266<p>Similarity queries are widely used in applications where users need to find objects that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
267<div class="section">
268<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
269<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
270<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
271<div class="section">
272<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
273<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
274
275<div class="source">
276<div class="source">
277<pre> use dataverse TinySocial;
278
279 for $user in dataset('FacebookUsers')
280 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
281 where $ed &lt;= 2
282 return $user
283</pre></div></div>
284<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
285
286<div class="source">
287<div class="source">
288<pre> use dataverse TinySocial;
289
290 for $user in dataset('FacebookUsers')
291 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
292 where $sim &gt;= 0.6f
293 return $user
294</pre></div></div>
295<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
296
297<div class="source">
298<div class="source">
299<pre> use dataverse TinySocial;
300
301 set simfunction &quot;jaccard&quot;;
302 set simthreshold &quot;0.6f&quot;;
303
304 for $user in dataset('FacebookUsers')
305 where $user.friend-ids ~= [1,5,9,10]
306 return $user
307</pre></div></div>
308<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
309<div class="section">
310<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
311<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
312
313<div class="source">
314<div class="source">
315<pre> use dataverse TinySocial;
316
317 set simfunction &quot;edit-distance&quot;;
318 set simthreshold &quot;3&quot;;
319
320 for $fbu in dataset FacebookUsers
321 return {
322 &quot;id&quot;: $fbu.id,
323 &quot;name&quot;: $fbu.name,
324 &quot;similar-users&quot;: for $t in dataset TweetMessages
325 let $tu := $t.user
326 where $tu.name ~= $fbu.name
327 return {
328 &quot;twitter-screenname&quot;: $tu.screen-name,
329 &quot;twitter-name&quot;: $tu.name
330 }
331 };
332</pre></div></div></div>
333<div class="section">
334<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
335<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
336<div class="section">
337<h3><a name="NGram_Index"></a>NGram Index</h3>
338<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
339<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
340
341<div class="source">
342<div class="source">
343<pre> use dataverse TinySocial;
344
345 create index fbUserIdx on FacebookUsers(name) type ngram(3);
346</pre></div></div>
347<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
348<div class="section">
349<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
350
351<div class="source">
352<div class="source">
353<pre> use dataverse TinySocial;
354
355 for $user in dataset('FacebookUsers')
356 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
357 where $ed &lt;= 2
358 return $user
359</pre></div></div></div>
360<div class="section">
361<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
362
363<div class="source">
364<div class="source">
365<pre> use dataverse TinySocial;
366
367 for $user in dataset('FacebookUsers')
368 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
369 where $ed[0]
370 return $ed[1]
371</pre></div></div></div>
372<div class="section">
373<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
374
375<div class="source">
376<div class="source">
377<pre> use dataverse TinySocial;
378
379 for $user in dataset('FacebookUsers')
380 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
381 where $sim &gt;= 0.6f
382 return $user
383</pre></div></div></div>
384<div class="section">
385<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
386
387<div class="source">
388<div class="source">
389<pre> use dataverse TinySocial;
390
391 for $user in dataset('FacebookUsers')
392 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
393 where $sim[0]
394 return $user
395</pre></div></div></div>
396<div class="section">
397<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
398
399<div class="source">
400<div class="source">
401<pre> use dataverse TinySocial;
402
403 for $i in dataset('FacebookMessages')
404 where contains($i.message, &quot;phone&quot;)
405 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
406</pre></div></div></div></div>
407<div class="section">
408<h3><a name="Keyword_Index"></a>Keyword Index</h3>
409<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the objects with this token. The following two examples show how to create keyword index on two different types:</p>
410<div class="section">
411<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
412
413<div class="source">
414<div class="source">
415<pre> use dataverse TinySocial;
416
417 drop index FacebookMessages.fbMessageIdx if exists;
418 create index fbMessageIdx on FacebookMessages(message) type keyword;
419
420 for $o in dataset('FacebookMessages')
421 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like ccast&quot;), 0.2f)
422 where $jacc[0]
423 return $o
424</pre></div></div></div>
425<div class="section">
426<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
427
428<div class="source">
429<div class="source">
430<pre> use dataverse TinySocial;
431
432 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
433
434 for $c in dataset('FacebookUsers')
435 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
436 where $jacc[0]
437 return $c
438</pre></div></div>
439<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
440 </div>
441 </div>
442 </div>
443
444 <hr/>
445
446 <footer>
447 <div class="container-fluid">
448 <div class="row span12">Copyright &copy; 2017
449 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
450 All Rights Reserved.
451
452 </div>
453
454 <?xml version="1.0" encoding="UTF-8"?>
455<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
456 feather logo, and the Apache AsterixDB project logo are either
457 registered trademarks or trademarks of The Apache Software
458 Foundation in the United States and other countries.
459 All other marks mentioned may be trademarks or registered
460 trademarks of their respective owners.</div>
461
462
463 </div>
464 </footer>
465 </body>
466</html>