blob: 70396f73c0e925480ac19aacbb69e7891d262d02 [file] [log] [blame]
Yingyi Bu08953b22016-03-25 15:23:26 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2016-03-25
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20160325" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
22<script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
23 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
24 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
25 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
26
27 ga('create', 'UA-41536543-1', 'uci.edu');
28 ga('send', 'pageview');</script>
29
30 </head>
31 <body class="topBarDisabled">
32
33
34
35
36 <div class="container-fluid">
37 <div id="banner">
38 <div class="pull-left">
39 <a href="http://asterixdb.apache.org/" id="bannerLeft">
40 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
41 </a>
42 </div>
43 <div class="pull-right"> </div>
44 <div class="clear"><hr/></div>
45 </div>
46
47 <div id="breadcrumbs">
48 <ul class="breadcrumb">
49
50
51 <li id="publishDate">Last Published: 2016-03-25</li>
52
53
54
55 <li id="projectVersion" class="pull-right">Version: 0.8.8-incubating</li>
56
57 <li class="divider pull-right">|</li>
58
59 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
60 Documentation Home</a>
61 </li>
62
63 </ul>
64 </div>
65
66
67 <div class="row-fluid">
68 <div id="leftColumn" class="span3">
69 <div class="well sidebar-nav">
70
71
72 <ul class="nav nav-list">
73 <li class="nav-header">Documentation</li>
74
75 <li>
76
77 <a href="../install.html" title="Installing and Managing AsterixDB using Managix">
78 <i class="none"></i>
79 Installing and Managing AsterixDB using Managix</a>
80 </li>
81
82 <li>
83
84 <a href="../yarn.html" title="Deploying AsterixDB using YARN">
85 <i class="none"></i>
86 Deploying AsterixDB using YARN</a>
87 </li>
88
89 <li>
90
91 <a href="../aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
92 <i class="none"></i>
93 AsterixDB 101: An ADM and AQL Primer</a>
94 </li>
95
96 <li>
97
98 <a href="../aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
99 <i class="none"></i>
100 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
101 </li>
102
103 <li>
104
105 <a href="../aql/js-sdk.html" title="AsterixDB Javascript SDK">
106 <i class="none"></i>
107 AsterixDB Javascript SDK</a>
108 </li>
109
110 <li>
111
112 <a href="../aql/datamodel.html" title="Asterix Data Model (ADM)">
113 <i class="none"></i>
114 Asterix Data Model (ADM)</a>
115 </li>
116
117 <li>
118
119 <a href="../aql/manual.html" title="Asterix Query Language (AQL)">
120 <i class="none"></i>
121 Asterix Query Language (AQL)</a>
122 </li>
123
124 <li>
125
126 <a href="../aql/functions.html" title="AQL Functions">
127 <i class="none"></i>
128 AQL Functions</a>
129 </li>
130
131 <li>
132
133 <a href="../aql/allens.html" title="AQL Allen's Relations Functions">
134 <i class="none"></i>
135 AQL Allen's Relations Functions</a>
136 </li>
137
138 <li class="active">
139
140 <a href="#"><i class="none"></i>AQL Support of Similarity Queries</a>
141 </li>
142
143 <li>
144
145 <a href="../aql/externaldata.html" title="Accessing External Data">
146 <i class="none"></i>
147 Accessing External Data</a>
148 </li>
149
150 <li>
151
152 <a href="../feeds/tutorial.html" title="Support for Data Ingestion in AsterixDB">
153 <i class="none"></i>
154 Support for Data Ingestion in AsterixDB</a>
155 </li>
156
157 <li>
158
159 <a href="../udf.html" title="Support for User Defined Functions in AsterixDB">
160 <i class="none"></i>
161 Support for User Defined Functions in AsterixDB</a>
162 </li>
163
164 <li>
165
166 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
167 <i class="none"></i>
168 Filter-Based LSM Index Acceleration</a>
169 </li>
170
171 <li>
172
173 <a href="../api.html" title="HTTP API to AsterixDB">
174 <i class="none"></i>
175 HTTP API to AsterixDB</a>
176 </li>
177 </ul>
178
179
180
181 <hr class="divider" />
182
183 <div id="poweredBy">
184 <div class="clear"></div>
185 <div class="clear"></div>
186 <div class="clear"></div>
187 <a href="https://code.google.com/p/hyracks/" title="Hyracks" class="builtBy">
188 <img class="builtBy" alt="Hyracks" src="../images/hyrax_ts.png" />
189 </a>
190 </div>
191 </div>
192 </div>
193
194
195 <div id="bodyColumn" class="span9" >
196
197 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
198 ! or more contributor license agreements. See the NOTICE file
199 ! distributed with this work for additional information
200 ! regarding copyright ownership. The ASF licenses this file
201 ! to you under the Apache License, Version 2.0 (the
202 ! "License"); you may not use this file except in compliance
203 ! with the License. You may obtain a copy of the License at
204 !
205 ! http://www.apache.org/licenses/LICENSE-2.0
206 !
207 ! Unless required by applicable law or agreed to in writing,
208 ! software distributed under the License is distributed on an
209 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
210 ! KIND, either express or implied. See the License for the
211 ! specific language governing permissions and limitations
212 ! under the License.
213 ! --><h1>AsterixDB Support of Similarity Queries</h1>
214<div class="section">
215<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
216
217<ul>
218
219<li><a href="#Motivation">Motivation</a></li>
220
221<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
222
223<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
224
225<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
226
227<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
228</ul></div>
229<div class="section">
230<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
231<p>Similarity queries are widely used in applications where users need to find records that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
232<div class="section">
233<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
234<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
235<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
236<div class="section">
237<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
238<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
239
240<div class="source">
241<div class="source">
242<pre> use dataverse TinySocial;
243
244 for $user in dataset('FacebookUsers')
245 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
246 where $ed &lt;= 2
247 return $user
248</pre></div></div>
249<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
250
251<div class="source">
252<div class="source">
253<pre> use dataverse TinySocial;
254
255 for $user in dataset('FacebookUsers')
256 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
257 where $sim &gt;= 0.6f
258 return $user
259</pre></div></div>
260<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
261
262<div class="source">
263<div class="source">
264<pre> use dataverse TinySocial;
265
266 set simfunction &quot;jaccard&quot;;
267 set simthreshold &quot;0.6f&quot;;
268
269 for $user in dataset('FacebookUsers')
270 where $user.friend-ids ~= [1,5,9,10]
271 return $user
272</pre></div></div>
273<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
274<div class="section">
275<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
276<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
277
278<div class="source">
279<div class="source">
280<pre> use dataverse TinySocial;
281
282 set simfunction &quot;edit-distance&quot;;
283 set simthreshold &quot;3&quot;;
284
285 for $fbu in dataset FacebookUsers
286 return {
287 &quot;id&quot;: $fbu.id,
288 &quot;name&quot;: $fbu.name,
289 &quot;similar-users&quot;: for $t in dataset TweetMessages
290 let $tu := $t.user
291 where $tu.name ~= $fbu.name
292 return {
293 &quot;twitter-screenname&quot;: $tu.screen-name,
294 &quot;twitter-name&quot;: $tu.name
295 }
296 };
297</pre></div></div></div>
298<div class="section">
299<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
300<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
301<div class="section">
302<h3><a name="NGram_Index"></a>NGram Index</h3>
303<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
304<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
305
306<div class="source">
307<div class="source">
308<pre> use dataverse TinySocial;
309
310 create index fbUserIdx on FacebookUsers(name) type ngram(3);
311</pre></div></div>
312<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
313<div class="section">
314<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
315
316<div class="source">
317<div class="source">
318<pre> use dataverse TinySocial;
319
320 for $user in dataset('FacebookUsers')
321 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
322 where $ed &lt;= 2
323 return $user
324</pre></div></div></div>
325<div class="section">
326<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
327
328<div class="source">
329<div class="source">
330<pre> use dataverse TinySocial;
331
332 for $user in dataset('FacebookUsers')
333 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
334 where $ed[0]
335 return $ed[1]
336</pre></div></div></div>
337<div class="section">
338<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
339
340<div class="source">
341<div class="source">
342<pre> use dataverse TinySocial;
343
344 for $user in dataset('FacebookUsers')
345 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
346 where $sim &gt;= 0.6f
347 return $user
348</pre></div></div></div>
349<div class="section">
350<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
351
352<div class="source">
353<div class="source">
354<pre> use dataverse TinySocial;
355
356 for $user in dataset('FacebookUsers')
357 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
358 where $sim[0]
359 return $user
360</pre></div></div></div>
361<div class="section">
362<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
363
364<div class="source">
365<div class="source">
366<pre> use dataverse TinySocial;
367
368 for $i in dataset('FacebookMessages')
369 where contains($i.message, &quot;phone&quot;)
370 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
371</pre></div></div></div></div>
372<div class="section">
373<h3><a name="Keyword_Index"></a>Keyword Index</h3>
374<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the records with this token. The following two examples show how to create keyword index on two different types:</p>
375<div class="section">
376<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
377
378<div class="source">
379<div class="source">
380<pre> use dataverse TinySocial;
381
382 drop index FacebookMessages.fbMessageIdx if exists;
383 create index fbMessageIdx on FacebookMessages(message) type keyword;
384
385 for $o in dataset('FacebookMessages')
386 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
387 where $jacc[0]
388 return $o
389</pre></div></div></div>
390<div class="section">
391<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
392
393<div class="source">
394<div class="source">
395<pre> use dataverse TinySocial;
396
397 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
398
399 for $c in dataset('FacebookUsers')
400 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
401 where $jacc[0]
402 return $c
403</pre></div></div>
404<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
405 </div>
406 </div>
407 </div>
408
409 <hr/>
410
411 <footer>
412 <div class="container-fluid">
413 <div class="row span12">Copyright &copy; 2016
414 <a href="http://www.apache.org/">The Apache Software Foundation</a>.
415 All Rights Reserved.
416
417 </div>
418
419 <?xml version="1.0" encoding="UTF-8"?>
420<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
421 feather logo, and the Apache AsterixDB project logo are either
422 registered trademarks or trademarks of The Apache Software
423 Foundation in the United States and other countries.
424 All other marks mentioned may be trademarks or registered
425 trademarks of their respective owners.</div>
426
427
428 </div>
429 </footer>
430 </body>
431</html>