blob: 749d8ac15c09ac544811af0511b58aa0fa45b234 [file] [log] [blame]
Yingyi Bu08953b22016-03-25 15:23:26 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2016-03-25
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20160325" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
22<script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
23 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
24 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
25 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
26
27 ga('create', 'UA-41536543-1', 'uci.edu');
28 ga('send', 'pageview');</script>
29
30 </head>
31 <body class="topBarDisabled">
32
33
34
35
36 <div class="container-fluid">
37 <div id="banner">
38 <div class="pull-left">
39 <a href="http://asterixdb.apache.org/" id="bannerLeft">
40 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
41 </a>
42 </div>
43 <div class="pull-right"> </div>
44 <div class="clear"><hr/></div>
45 </div>
46
47 <div id="breadcrumbs">
48 <ul class="breadcrumb">
49
50
51 <li id="publishDate">Last Published: 2016-03-25</li>
52
53
54
55 <li id="projectVersion" class="pull-right">Version: 0.8.8-incubating</li>
56
57 <li class="divider pull-right">|</li>
58
59 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
60 Documentation Home</a>
61 </li>
62
63 </ul>
64 </div>
65
66
67 <div class="row-fluid">
68 <div id="leftColumn" class="span3">
69 <div class="well sidebar-nav">
70
71
72 <ul class="nav nav-list">
73 <li class="nav-header">Documentation</li>
74
75 <li>
76
77 <a href="../install.html" title="Installing and Managing AsterixDB using Managix">
78 <i class="none"></i>
79 Installing and Managing AsterixDB using Managix</a>
80 </li>
81
82 <li>
83
84 <a href="../yarn.html" title="Deploying AsterixDB using YARN">
85 <i class="none"></i>
86 Deploying AsterixDB using YARN</a>
87 </li>
88
89 <li>
90
91 <a href="../aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
92 <i class="none"></i>
93 AsterixDB 101: An ADM and AQL Primer</a>
94 </li>
95
96 <li>
97
98 <a href="../aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
99 <i class="none"></i>
100 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
101 </li>
102
103 <li>
104
Yingyi Bu08953b22016-03-25 15:23:26 -0700105 <a href="../aql/datamodel.html" title="Asterix Data Model (ADM)">
106 <i class="none"></i>
107 Asterix Data Model (ADM)</a>
108 </li>
109
110 <li>
111
112 <a href="../aql/manual.html" title="Asterix Query Language (AQL)">
113 <i class="none"></i>
114 Asterix Query Language (AQL)</a>
115 </li>
116
117 <li>
118
119 <a href="../aql/functions.html" title="AQL Functions">
120 <i class="none"></i>
121 AQL Functions</a>
122 </li>
123
124 <li>
125
126 <a href="../aql/allens.html" title="AQL Allen's Relations Functions">
127 <i class="none"></i>
128 AQL Allen's Relations Functions</a>
129 </li>
130
131 <li class="active">
132
133 <a href="#"><i class="none"></i>AQL Support of Similarity Queries</a>
134 </li>
135
136 <li>
137
138 <a href="../aql/externaldata.html" title="Accessing External Data">
139 <i class="none"></i>
140 Accessing External Data</a>
141 </li>
142
143 <li>
144
145 <a href="../feeds/tutorial.html" title="Support for Data Ingestion in AsterixDB">
146 <i class="none"></i>
147 Support for Data Ingestion in AsterixDB</a>
148 </li>
149
150 <li>
151
152 <a href="../udf.html" title="Support for User Defined Functions in AsterixDB">
153 <i class="none"></i>
154 Support for User Defined Functions in AsterixDB</a>
155 </li>
156
157 <li>
158
159 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
160 <i class="none"></i>
161 Filter-Based LSM Index Acceleration</a>
162 </li>
163
164 <li>
165
166 <a href="../api.html" title="HTTP API to AsterixDB">
167 <i class="none"></i>
168 HTTP API to AsterixDB</a>
169 </li>
170 </ul>
171
172
173
174 <hr class="divider" />
175
176 <div id="poweredBy">
177 <div class="clear"></div>
178 <div class="clear"></div>
179 <div class="clear"></div>
180 <a href="https://code.google.com/p/hyracks/" title="Hyracks" class="builtBy">
181 <img class="builtBy" alt="Hyracks" src="../images/hyrax_ts.png" />
182 </a>
183 </div>
184 </div>
185 </div>
186
187
188 <div id="bodyColumn" class="span9" >
189
190 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
191 ! or more contributor license agreements. See the NOTICE file
192 ! distributed with this work for additional information
193 ! regarding copyright ownership. The ASF licenses this file
194 ! to you under the Apache License, Version 2.0 (the
195 ! "License"); you may not use this file except in compliance
196 ! with the License. You may obtain a copy of the License at
197 !
198 ! http://www.apache.org/licenses/LICENSE-2.0
199 !
200 ! Unless required by applicable law or agreed to in writing,
201 ! software distributed under the License is distributed on an
202 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
203 ! KIND, either express or implied. See the License for the
204 ! specific language governing permissions and limitations
205 ! under the License.
206 ! --><h1>AsterixDB Support of Similarity Queries</h1>
207<div class="section">
208<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
209
210<ul>
211
212<li><a href="#Motivation">Motivation</a></li>
213
214<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
215
216<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
217
218<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
219
220<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
221</ul></div>
222<div class="section">
223<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
224<p>Similarity queries are widely used in applications where users need to find records that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
225<div class="section">
226<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
227<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
228<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
229<div class="section">
230<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
231<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
232
233<div class="source">
234<div class="source">
235<pre> use dataverse TinySocial;
236
237 for $user in dataset('FacebookUsers')
238 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
239 where $ed &lt;= 2
240 return $user
241</pre></div></div>
242<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
243
244<div class="source">
245<div class="source">
246<pre> use dataverse TinySocial;
247
248 for $user in dataset('FacebookUsers')
249 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
250 where $sim &gt;= 0.6f
251 return $user
252</pre></div></div>
253<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
254
255<div class="source">
256<div class="source">
257<pre> use dataverse TinySocial;
258
259 set simfunction &quot;jaccard&quot;;
260 set simthreshold &quot;0.6f&quot;;
261
262 for $user in dataset('FacebookUsers')
263 where $user.friend-ids ~= [1,5,9,10]
264 return $user
265</pre></div></div>
266<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
267<div class="section">
268<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
269<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
270
271<div class="source">
272<div class="source">
273<pre> use dataverse TinySocial;
274
275 set simfunction &quot;edit-distance&quot;;
276 set simthreshold &quot;3&quot;;
277
278 for $fbu in dataset FacebookUsers
279 return {
280 &quot;id&quot;: $fbu.id,
281 &quot;name&quot;: $fbu.name,
282 &quot;similar-users&quot;: for $t in dataset TweetMessages
283 let $tu := $t.user
284 where $tu.name ~= $fbu.name
285 return {
286 &quot;twitter-screenname&quot;: $tu.screen-name,
287 &quot;twitter-name&quot;: $tu.name
288 }
289 };
290</pre></div></div></div>
291<div class="section">
292<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
293<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
294<div class="section">
295<h3><a name="NGram_Index"></a>NGram Index</h3>
296<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
297<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
298
299<div class="source">
300<div class="source">
301<pre> use dataverse TinySocial;
302
303 create index fbUserIdx on FacebookUsers(name) type ngram(3);
304</pre></div></div>
305<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
306<div class="section">
307<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
308
309<div class="source">
310<div class="source">
311<pre> use dataverse TinySocial;
312
313 for $user in dataset('FacebookUsers')
314 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
315 where $ed &lt;= 2
316 return $user
317</pre></div></div></div>
318<div class="section">
319<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
320
321<div class="source">
322<div class="source">
323<pre> use dataverse TinySocial;
324
325 for $user in dataset('FacebookUsers')
326 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
327 where $ed[0]
328 return $ed[1]
329</pre></div></div></div>
330<div class="section">
331<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
332
333<div class="source">
334<div class="source">
335<pre> use dataverse TinySocial;
336
337 for $user in dataset('FacebookUsers')
338 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
339 where $sim &gt;= 0.6f
340 return $user
341</pre></div></div></div>
342<div class="section">
343<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
344
345<div class="source">
346<div class="source">
347<pre> use dataverse TinySocial;
348
349 for $user in dataset('FacebookUsers')
350 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
351 where $sim[0]
352 return $user
353</pre></div></div></div>
354<div class="section">
355<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
356
357<div class="source">
358<div class="source">
359<pre> use dataverse TinySocial;
360
361 for $i in dataset('FacebookMessages')
362 where contains($i.message, &quot;phone&quot;)
363 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
364</pre></div></div></div></div>
365<div class="section">
366<h3><a name="Keyword_Index"></a>Keyword Index</h3>
367<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the records with this token. The following two examples show how to create keyword index on two different types:</p>
368<div class="section">
369<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
370
371<div class="source">
372<div class="source">
373<pre> use dataverse TinySocial;
374
375 drop index FacebookMessages.fbMessageIdx if exists;
376 create index fbMessageIdx on FacebookMessages(message) type keyword;
377
378 for $o in dataset('FacebookMessages')
379 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
380 where $jacc[0]
381 return $o
382</pre></div></div></div>
383<div class="section">
384<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
385
386<div class="source">
387<div class="source">
388<pre> use dataverse TinySocial;
389
390 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
391
392 for $c in dataset('FacebookUsers')
393 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
394 where $jacc[0]
395 return $c
396</pre></div></div>
397<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
398 </div>
399 </div>
400 </div>
401
402 <hr/>
403
404 <footer>
405 <div class="container-fluid">
406 <div class="row span12">Copyright &copy; 2016
407 <a href="http://www.apache.org/">The Apache Software Foundation</a>.
408 All Rights Reserved.
409
410 </div>
411
412 <?xml version="1.0" encoding="UTF-8"?>
413<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
414 feather logo, and the Apache AsterixDB project logo are either
415 registered trademarks or trademarks of The Apache Software
416 Foundation in the United States and other countries.
417 All other marks mentioned may be trademarks or registered
418 trademarks of their respective owners.</div>
419
420
421 </div>
422 </footer>
423 </body>
424</html>