blob: 6e15d35a888c9c4358389f7308c692af9a08345f [file] [log] [blame]
Ian Maxonbf2c56b2017-01-24 14:14:49 -08001<!DOCTYPE html>
2<!--
Ian Maxond5b11d82017-01-25 10:48:05 -08003 | Generated by Apache Maven Doxia at 2017-01-25
Ian Maxonbf2c56b2017-01-24 14:14:49 -08004 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
Ian Maxond5b11d82017-01-25 10:48:05 -080010 <meta name="Date-Revision-yyyymmdd" content="20170125" />
Ian Maxonbf2c56b2017-01-24 14:14:49 -080011 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Ian Maxonbf2c56b2017-01-24 14:14:49 -080022
Ian Maxonbf2c56b2017-01-24 14:14:49 -080023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href=".././" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
Ian Maxond5b11d82017-01-25 10:48:05 -080045 <li id="publishDate">Last Published: 2017-01-25</li>
Ian Maxonbf2c56b2017-01-24 14:14:49 -080046
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.9.0</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Get Started - Installation</li>
68
69 <li>
70
71 <a href="../ncservice.html" title="Option 1: using NCService">
72 <i class="none"></i>
73 Option 1: using NCService</a>
74 </li>
75
76 <li>
77
78 <a href="../install.html" title="Option 2: using Managix">
79 <i class="none"></i>
80 Option 2: using Managix</a>
81 </li>
82
83 <li>
84
85 <a href="../yarn.html" title="Option 3: using YARN">
86 <i class="none"></i>
87 Option 3: using YARN</a>
88 </li>
89 <li class="nav-header">AsterixDB Primer</li>
90
91 <li>
92
93 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
94 <i class="none"></i>
95 Option 1: using SQL++</a>
96 </li>
97
98 <li>
99
100 <a href="../aql/primer.html" title="Option 2: using AQL">
101 <i class="none"></i>
102 Option 2: using AQL</a>
103 </li>
104 <li class="nav-header">Data Model</li>
105
106 <li>
107
108 <a href="../datamodel.html" title="The Asterix Data Model">
109 <i class="none"></i>
110 The Asterix Data Model</a>
111 </li>
112 <li class="nav-header">Queries - SQL++</li>
113
114 <li>
115
116 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
117 <i class="none"></i>
118 The SQL++ Query Language</a>
119 </li>
120
121 <li>
122
123 <a href="../sqlpp/builtins.html" title="Builtin Functions">
124 <i class="none"></i>
125 Builtin Functions</a>
126 </li>
127 <li class="nav-header">Queries - AQL</li>
128
129 <li>
130
131 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
132 <i class="none"></i>
133 The Asterix Query Language (AQL)</a>
134 </li>
135
136 <li>
137
138 <a href="../aql/builtins.html" title="Builtin Functions">
139 <i class="none"></i>
140 Builtin Functions</a>
141 </li>
142 <li class="nav-header">Advanced Features</li>
143
144 <li class="active">
145
146 <a href="#"><i class="none"></i>Support of Similarity Queries</a>
147 </li>
148
149 <li>
150
151 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
152 <i class="none"></i>
153 Support of Full-text Queries</a>
154 </li>
155
156 <li>
157
158 <a href="../aql/externaldata.html" title="Accessing External Data">
159 <i class="none"></i>
160 Accessing External Data</a>
161 </li>
162
163 <li>
164
165 <a href="../feeds/tutorial.html" title="Support for Data Ingestion">
166 <i class="none"></i>
167 Support for Data Ingestion</a>
168 </li>
169
170 <li>
171
172 <a href="../udf.html" title="User Defined Functions">
173 <i class="none"></i>
174 User Defined Functions</a>
175 </li>
176
177 <li>
178
179 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
180 <i class="none"></i>
181 Filter-Based LSM Index Acceleration</a>
182 </li>
183 <li class="nav-header">API/SDK</li>
184
185 <li>
186
187 <a href="../api.html" title="HTTP API">
188 <i class="none"></i>
189 HTTP API</a>
190 </li>
191 </ul>
192
193
194
195 <hr class="divider" />
196
197 <div id="poweredBy">
198 <div class="clear"></div>
199 <div class="clear"></div>
200 <div class="clear"></div>
201 <a href=".././" title="AsterixDB" class="builtBy">
202 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
203 </a>
204 </div>
205 </div>
206 </div>
207
208
209 <div id="bodyColumn" class="span9" >
210
211 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
212 ! or more contributor license agreements. See the NOTICE file
213 ! distributed with this work for additional information
214 ! regarding copyright ownership. The ASF licenses this file
215 ! to you under the Apache License, Version 2.0 (the
216 ! "License"); you may not use this file except in compliance
217 ! with the License. You may obtain a copy of the License at
218 !
219 ! http://www.apache.org/licenses/LICENSE-2.0
220 !
221 ! Unless required by applicable law or agreed to in writing,
222 ! software distributed under the License is distributed on an
223 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
224 ! KIND, either express or implied. See the License for the
225 ! specific language governing permissions and limitations
226 ! under the License.
227 ! --><h1>AsterixDB Support of Similarity Queries</h1>
228<div class="section">
229<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
230
231<ul>
232
233<li><a href="#Motivation">Motivation</a></li>
234
235<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
236
237<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
238
239<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
240
241<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
242</ul></div>
243<div class="section">
244<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
245<p>Similarity queries are widely used in applications where users need to find objects that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
246<div class="section">
247<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
248<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
249<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
250<div class="section">
251<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
252<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
253
254<div class="source">
255<div class="source">
256<pre> use dataverse TinySocial;
257
258 for $user in dataset('FacebookUsers')
259 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
260 where $ed &lt;= 2
261 return $user
262</pre></div></div>
263<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
264
265<div class="source">
266<div class="source">
267<pre> use dataverse TinySocial;
268
269 for $user in dataset('FacebookUsers')
270 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
271 where $sim &gt;= 0.6f
272 return $user
273</pre></div></div>
274<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
275
276<div class="source">
277<div class="source">
278<pre> use dataverse TinySocial;
279
280 set simfunction &quot;jaccard&quot;;
281 set simthreshold &quot;0.6f&quot;;
282
283 for $user in dataset('FacebookUsers')
284 where $user.friend-ids ~= [1,5,9,10]
285 return $user
286</pre></div></div>
287<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
288<div class="section">
289<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
290<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
291
292<div class="source">
293<div class="source">
294<pre> use dataverse TinySocial;
295
296 set simfunction &quot;edit-distance&quot;;
297 set simthreshold &quot;3&quot;;
298
299 for $fbu in dataset FacebookUsers
300 return {
301 &quot;id&quot;: $fbu.id,
302 &quot;name&quot;: $fbu.name,
303 &quot;similar-users&quot;: for $t in dataset TweetMessages
304 let $tu := $t.user
305 where $tu.name ~= $fbu.name
306 return {
307 &quot;twitter-screenname&quot;: $tu.screen-name,
308 &quot;twitter-name&quot;: $tu.name
309 }
310 };
311</pre></div></div></div>
312<div class="section">
313<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
314<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
315<div class="section">
316<h3><a name="NGram_Index"></a>NGram Index</h3>
317<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
318<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
319
320<div class="source">
321<div class="source">
322<pre> use dataverse TinySocial;
323
324 create index fbUserIdx on FacebookUsers(name) type ngram(3);
325</pre></div></div>
326<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
327<div class="section">
328<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
329
330<div class="source">
331<div class="source">
332<pre> use dataverse TinySocial;
333
334 for $user in dataset('FacebookUsers')
335 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
336 where $ed &lt;= 2
337 return $user
338</pre></div></div></div>
339<div class="section">
340<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
341
342<div class="source">
343<div class="source">
344<pre> use dataverse TinySocial;
345
346 for $user in dataset('FacebookUsers')
347 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
348 where $ed[0]
349 return $ed[1]
350</pre></div></div></div>
351<div class="section">
352<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
353
354<div class="source">
355<div class="source">
356<pre> use dataverse TinySocial;
357
358 for $user in dataset('FacebookUsers')
359 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
360 where $sim &gt;= 0.6f
361 return $user
362</pre></div></div></div>
363<div class="section">
364<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
365
366<div class="source">
367<div class="source">
368<pre> use dataverse TinySocial;
369
370 for $user in dataset('FacebookUsers')
371 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
372 where $sim[0]
373 return $user
374</pre></div></div></div>
375<div class="section">
376<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
377
378<div class="source">
379<div class="source">
380<pre> use dataverse TinySocial;
381
382 for $i in dataset('FacebookMessages')
383 where contains($i.message, &quot;phone&quot;)
384 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
385</pre></div></div></div></div>
386<div class="section">
387<h3><a name="Keyword_Index"></a>Keyword Index</h3>
388<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the objects with this token. The following two examples show how to create keyword index on two different types:</p>
389<div class="section">
390<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
391
392<div class="source">
393<div class="source">
394<pre> use dataverse TinySocial;
395
396 drop index FacebookMessages.fbMessageIdx if exists;
397 create index fbMessageIdx on FacebookMessages(message) type keyword;
398
399 for $o in dataset('FacebookMessages')
400 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
401 where $jacc[0]
402 return $o
403</pre></div></div></div>
404<div class="section">
405<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
406
407<div class="source">
408<div class="source">
409<pre> use dataverse TinySocial;
410
411 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
412
413 for $c in dataset('FacebookUsers')
414 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
415 where $jacc[0]
416 return $c
417</pre></div></div>
418<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
419 </div>
420 </div>
421 </div>
422
423 <hr/>
424
425 <footer>
426 <div class="container-fluid">
427 <div class="row span12">Copyright &copy; 2017
428 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
429 All Rights Reserved.
430
431 </div>
432
433 <?xml version="1.0" encoding="UTF-8"?>
434<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
435 feather logo, and the Apache AsterixDB project logo are either
436 registered trademarks or trademarks of The Apache Software
437 Foundation in the United States and other countries.
438 All other marks mentioned may be trademarks or registered
439 trademarks of their respective owners.</div>
440
441
442 </div>
443 </footer>
444 </body>
445</html>