blob: f70aed30f7270448a406b3254973fb785bba0722 [file] [log] [blame]
Ian Maxon100cb802017-04-24 18:48:07 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2017-04-24
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20170424" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
22<script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
23 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
24 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
25 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
26
27 ga('create', 'UA-41536543-1', 'uci.edu');
28 ga('send', 'pageview');</script>
29
30 </head>
31 <body class="topBarDisabled">
32
33
34
35
36 <div class="container-fluid">
37 <div id="banner">
38 <div class="pull-left">
39 <a href=".././" id="bannerLeft">
40 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
41 </a>
42 </div>
43 <div class="pull-right"> </div>
44 <div class="clear"><hr/></div>
45 </div>
46
47 <div id="breadcrumbs">
48 <ul class="breadcrumb">
49
50
51 <li id="publishDate">Last Published: 2017-04-24</li>
52
53
54
55 <li id="projectVersion" class="pull-right">Version: 0.9.1</li>
56
57 <li class="divider pull-right">|</li>
58
59 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
60 Documentation Home</a>
61 </li>
62
63 </ul>
64 </div>
65
66
67 <div class="row-fluid">
68 <div id="leftColumn" class="span3">
69 <div class="well sidebar-nav">
70
71
72 <ul class="nav nav-list">
73 <li class="nav-header">Get Started - Installation</li>
74
75 <li>
76
77 <a href="../ncservice.html" title="Option 1: using NCService">
78 <i class="none"></i>
79 Option 1: using NCService</a>
80 </li>
81
82 <li>
83
84 <a href="../ansible.html" title="Option 2: using Ansible">
85 <i class="none"></i>
86 Option 2: using Ansible</a>
87 </li>
88
89 <li>
90
91 <a href="../aws.html" title="Option 3: using Amazon Web Services">
92 <i class="none"></i>
93 Option 3: using Amazon Web Services</a>
94 </li>
95
96 <li>
97
98 <a href="../yarn.html" title="Option 4: using YARN">
99 <i class="none"></i>
100 Option 4: using YARN</a>
101 </li>
102
103 <li>
104
105 <a href="../install.html" title="Option 5: using Managix (deprecated)">
106 <i class="none"></i>
107 Option 5: using Managix (deprecated)</a>
108 </li>
109 <li class="nav-header">AsterixDB Primer</li>
110
111 <li>
112
113 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
114 <i class="none"></i>
115 Option 1: using SQL++</a>
116 </li>
117
118 <li>
119
120 <a href="../aql/primer.html" title="Option 2: using AQL">
121 <i class="none"></i>
122 Option 2: using AQL</a>
123 </li>
124 <li class="nav-header">Data Model</li>
125
126 <li>
127
128 <a href="../datamodel.html" title="The Asterix Data Model">
129 <i class="none"></i>
130 The Asterix Data Model</a>
131 </li>
132 <li class="nav-header">Queries - SQL++</li>
133
134 <li>
135
136 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
137 <i class="none"></i>
138 The SQL++ Query Language</a>
139 </li>
140
141 <li>
142
143 <a href="../sqlpp/builtins.html" title="Builtin Functions">
144 <i class="none"></i>
145 Builtin Functions</a>
146 </li>
147 <li class="nav-header">Queries - AQL</li>
148
149 <li>
150
151 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
152 <i class="none"></i>
153 The Asterix Query Language (AQL)</a>
154 </li>
155
156 <li>
157
158 <a href="../aql/builtins.html" title="Builtin Functions">
159 <i class="none"></i>
160 Builtin Functions</a>
161 </li>
162 <li class="nav-header">API/SDK</li>
163
164 <li>
165
166 <a href="../api.html" title="HTTP API">
167 <i class="none"></i>
168 HTTP API</a>
169 </li>
170
171 <li>
172
173 <a href="../csv.html" title="CSV Output">
174 <i class="none"></i>
175 CSV Output</a>
176 </li>
177 <li class="nav-header">Advanced Features</li>
178
179 <li>
180
181 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
182 <i class="none"></i>
183 Support of Full-text Queries</a>
184 </li>
185
186 <li>
187
188 <a href="../aql/externaldata.html" title="Accessing External Data">
189 <i class="none"></i>
190 Accessing External Data</a>
191 </li>
192
193 <li>
194
195 <a href="../feeds/tutorial.html" title="Support for Data Ingestion">
196 <i class="none"></i>
197 Support for Data Ingestion</a>
198 </li>
199
200 <li>
201
202 <a href="../udf.html" title="User Defined Functions">
203 <i class="none"></i>
204 User Defined Functions</a>
205 </li>
206
207 <li>
208
209 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
210 <i class="none"></i>
211 Filter-Based LSM Index Acceleration</a>
212 </li>
213
214 <li class="active">
215
216 <a href="#"><i class="none"></i>Support of Similarity Queries</a>
217 </li>
218 </ul>
219
220
221
222 <hr class="divider" />
223
224 <div id="poweredBy">
225 <div class="clear"></div>
226 <div class="clear"></div>
227 <div class="clear"></div>
228 <a href=".././" title="AsterixDB" class="builtBy">
229 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
230 </a>
231 </div>
232 </div>
233 </div>
234
235
236 <div id="bodyColumn" class="span9" >
237
238 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
239 ! or more contributor license agreements. See the NOTICE file
240 ! distributed with this work for additional information
241 ! regarding copyright ownership. The ASF licenses this file
242 ! to you under the Apache License, Version 2.0 (the
243 ! "License"); you may not use this file except in compliance
244 ! with the License. You may obtain a copy of the License at
245 !
246 ! http://www.apache.org/licenses/LICENSE-2.0
247 !
248 ! Unless required by applicable law or agreed to in writing,
249 ! software distributed under the License is distributed on an
250 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
251 ! KIND, either express or implied. See the License for the
252 ! specific language governing permissions and limitations
253 ! under the License.
254 ! --><h1>AsterixDB Support of Similarity Queries</h1>
255<div class="section">
256<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
257
258<ul>
259
260<li><a href="#Motivation">Motivation</a></li>
261
262<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
263
264<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
265
266<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
267
268<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
269</ul></div>
270<div class="section">
271<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
272<p>Similarity queries are widely used in applications where users need to find objects that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
273<div class="section">
274<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
275<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
276<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
277<div class="section">
278<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
279<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
280
281<div class="source">
282<div class="source">
283<pre> use dataverse TinySocial;
284
285 for $user in dataset('FacebookUsers')
286 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
287 where $ed &lt;= 2
288 return $user
289</pre></div></div>
290<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
291
292<div class="source">
293<div class="source">
294<pre> use dataverse TinySocial;
295
296 for $user in dataset('FacebookUsers')
297 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
298 where $sim &gt;= 0.6f
299 return $user
300</pre></div></div>
301<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
302
303<div class="source">
304<div class="source">
305<pre> use dataverse TinySocial;
306
307 set simfunction &quot;jaccard&quot;;
308 set simthreshold &quot;0.6f&quot;;
309
310 for $user in dataset('FacebookUsers')
311 where $user.friend-ids ~= [1,5,9,10]
312 return $user
313</pre></div></div>
314<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
315<div class="section">
316<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
317<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
318
319<div class="source">
320<div class="source">
321<pre> use dataverse TinySocial;
322
323 set simfunction &quot;edit-distance&quot;;
324 set simthreshold &quot;3&quot;;
325
326 for $fbu in dataset FacebookUsers
327 return {
328 &quot;id&quot;: $fbu.id,
329 &quot;name&quot;: $fbu.name,
330 &quot;similar-users&quot;: for $t in dataset TweetMessages
331 let $tu := $t.user
332 where $tu.name ~= $fbu.name
333 return {
334 &quot;twitter-screenname&quot;: $tu.screen-name,
335 &quot;twitter-name&quot;: $tu.name
336 }
337 };
338</pre></div></div></div>
339<div class="section">
340<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
341<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
342<div class="section">
343<h3><a name="NGram_Index"></a>NGram Index</h3>
344<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
345<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
346
347<div class="source">
348<div class="source">
349<pre> use dataverse TinySocial;
350
351 create index fbUserIdx on FacebookUsers(name) type ngram(3);
352</pre></div></div>
353<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
354<div class="section">
355<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
356
357<div class="source">
358<div class="source">
359<pre> use dataverse TinySocial;
360
361 for $user in dataset('FacebookUsers')
362 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
363 where $ed &lt;= 2
364 return $user
365</pre></div></div></div>
366<div class="section">
367<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
368
369<div class="source">
370<div class="source">
371<pre> use dataverse TinySocial;
372
373 for $user in dataset('FacebookUsers')
374 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
375 where $ed[0]
376 return $ed[1]
377</pre></div></div></div>
378<div class="section">
379<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
380
381<div class="source">
382<div class="source">
383<pre> use dataverse TinySocial;
384
385 for $user in dataset('FacebookUsers')
386 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
387 where $sim &gt;= 0.6f
388 return $user
389</pre></div></div></div>
390<div class="section">
391<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
392
393<div class="source">
394<div class="source">
395<pre> use dataverse TinySocial;
396
397 for $user in dataset('FacebookUsers')
398 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
399 where $sim[0]
400 return $user
401</pre></div></div></div>
402<div class="section">
403<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
404
405<div class="source">
406<div class="source">
407<pre> use dataverse TinySocial;
408
409 for $i in dataset('FacebookMessages')
410 where contains($i.message, &quot;phone&quot;)
411 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
412</pre></div></div></div></div>
413<div class="section">
414<h3><a name="Keyword_Index"></a>Keyword Index</h3>
415<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the objects with this token. The following two examples show how to create keyword index on two different types:</p>
416<div class="section">
417<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
418
419<div class="source">
420<div class="source">
421<pre> use dataverse TinySocial;
422
423 drop index FacebookMessages.fbMessageIdx if exists;
424 create index fbMessageIdx on FacebookMessages(message) type keyword;
425
426 for $o in dataset('FacebookMessages')
427 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like ccast&quot;), 0.2f)
428 where $jacc[0]
429 return $o
430</pre></div></div></div>
431<div class="section">
432<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
433
434<div class="source">
435<div class="source">
436<pre> use dataverse TinySocial;
437
438 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
439
440 for $c in dataset('FacebookUsers')
441 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
442 where $jacc[0]
443 return $c
444</pre></div></div>
445<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
446 </div>
447 </div>
448 </div>
449
450 <hr/>
451
452 <footer>
453 <div class="container-fluid">
454 <div class="row span12">Copyright &copy; 2017
455 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
456 All Rights Reserved.
457
458 </div>
459
460 <?xml version="1.0" encoding="UTF-8"?>
461<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
462 feather logo, and the Apache AsterixDB project logo are either
463 registered trademarks or trademarks of The Apache Software
464 Foundation in the United States and other countries.
465 All other marks mentioned may be trademarks or registered
466 trademarks of their respective owners.</div>
467
468
469 </div>
470 </footer>
471 </body>
472</html>