blob: e36d22fef5e3099f6cb530ed54d57e74b51ce690 [file] [log] [blame]
Ian Maxone2b799e2015-11-24 18:20:03 -08001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2015-11-24
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20151124" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; AsterixDB Support of Similarity Queries</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Ian Maxone2b799e2015-11-24 18:20:03 -080022
Ian Maxone2b799e2015-11-24 18:20:03 -080023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href="http://asterixdb.apache.org/" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
45 <li id="publishDate">Last Published: 2015-11-24</li>
46
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.8.7-incubating</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Documentation</li>
68
69 <li>
70
71 <a href="../install.html" title="Installing and Managing AsterixDB using Managix">
72 <i class="none"></i>
73 Installing and Managing AsterixDB using Managix</a>
74 </li>
75
76 <li>
77
78 <a href="../yarn.html" title="Deploying AsterixDB using YARN">
79 <i class="none"></i>
80 Deploying AsterixDB using YARN</a>
81 </li>
82
83 <li>
84
85 <a href="../aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
86 <i class="none"></i>
87 AsterixDB 101: An ADM and AQL Primer</a>
88 </li>
89
90 <li>
91
92 <a href="../aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
93 <i class="none"></i>
94 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
95 </li>
96
97 <li>
98
99 <a href="../aql/js-sdk.html" title="AsterixDB Javascript SDK">
100 <i class="none"></i>
101 AsterixDB Javascript SDK</a>
102 </li>
103
104 <li>
105
106 <a href="../aql/datamodel.html" title="Asterix Data Model (ADM)">
107 <i class="none"></i>
108 Asterix Data Model (ADM)</a>
109 </li>
110
111 <li>
112
113 <a href="../aql/manual.html" title="Asterix Query Language (AQL)">
114 <i class="none"></i>
115 Asterix Query Language (AQL)</a>
116 </li>
117
118 <li>
119
120 <a href="../aql/functions.html" title="AQL Functions">
121 <i class="none"></i>
122 AQL Functions</a>
123 </li>
124
125 <li>
126
127 <a href="../aql/allens.html" title="AQL Allen's Relations Functions">
128 <i class="none"></i>
129 AQL Allen's Relations Functions</a>
130 </li>
131
132 <li class="active">
133
134 <a href="#"><i class="none"></i>AQL Support of Similarity Queries</a>
135 </li>
136
137 <li>
138
139 <a href="../aql/externaldata.html" title="Accessing External Data">
140 <i class="none"></i>
141 Accessing External Data</a>
142 </li>
143
144 <li>
145
146 <a href="../feeds/tutorial.html" title="Support for Data Ingestion in AsterixDB">
147 <i class="none"></i>
148 Support for Data Ingestion in AsterixDB</a>
149 </li>
150
151 <li>
152
153 <a href="../udf.html" title="Support for User Defined Functions in AsterixDB">
154 <i class="none"></i>
155 Support for User Defined Functions in AsterixDB</a>
156 </li>
157
158 <li>
159
160 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
161 <i class="none"></i>
162 Filter-Based LSM Index Acceleration</a>
163 </li>
164
165 <li>
166
167 <a href="../api.html" title="HTTP API to AsterixDB">
168 <i class="none"></i>
169 HTTP API to AsterixDB</a>
170 </li>
171 </ul>
172
173
174
175 <hr class="divider" />
176
177 <div id="poweredBy">
178 <div class="clear"></div>
179 <div class="clear"></div>
180 <div class="clear"></div>
181 <a href="https://code.google.com/p/hyracks/" title="Hyracks" class="builtBy">
182 <img class="builtBy" alt="Hyracks" src="../images/hyrax_ts.png" />
183 </a>
184 </div>
185 </div>
186 </div>
187
188
189 <div id="bodyColumn" class="span9" >
190
191 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
192 ! or more contributor license agreements. See the NOTICE file
193 ! distributed with this work for additional information
194 ! regarding copyright ownership. The ASF licenses this file
195 ! to you under the Apache License, Version 2.0 (the
196 ! "License"); you may not use this file except in compliance
197 ! with the License. You may obtain a copy of the License at
198 !
199 ! http://www.apache.org/licenses/LICENSE-2.0
200 !
201 ! Unless required by applicable law or agreed to in writing,
202 ! software distributed under the License is distributed on an
203 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
204 ! KIND, either express or implied. See the License for the
205 ! specific language governing permissions and limitations
206 ! under the License.
207 ! --><h1>AsterixDB Support of Similarity Queries</h1>
208<div class="section">
209<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
210
211<ul>
212
213<li><a href="#Motivation">Motivation</a></li>
214
215<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
216
217<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
218
219<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
220
221<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
222</ul></div>
223<div class="section">
224<h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
225<p>Similarity queries are widely used in applications where users need to find records that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
226<div class="section">
227<h2><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
228<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
229<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
230<div class="section">
231<h2><a name="Similarity_Selection_Queries_Back_to_TOC"></a><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
232<p>The following query asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
233
234<div class="source">
235<div class="source">
236<pre> use dataverse TinySocial;
237
238 for $user in dataset('FacebookUsers')
239 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
240 where $ed &lt;= 2
241 return $user
242</pre></div></div>
243<p>The following query asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9,10]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
244
245<div class="source">
246<div class="source">
247<pre> use dataverse TinySocial;
248
249 for $user in dataset('FacebookUsers')
250 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
251 where $sim &gt;= 0.6f
252 return $user
253</pre></div></div>
254<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
255
256<div class="source">
257<div class="source">
258<pre> use dataverse TinySocial;
259
260 set simfunction &quot;jaccard&quot;;
261 set simthreshold &quot;0.6f&quot;;
262
263 for $user in dataset('FacebookUsers')
264 where $user.friend-ids ~= [1,5,9,10]
265 return $user
266</pre></div></div>
267<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
268<div class="section">
269<h2><a name="Similarity_Join_Queries_Back_to_TOC"></a><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
270<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
271
272<div class="source">
273<div class="source">
274<pre> use dataverse TinySocial;
275
276 set simfunction &quot;edit-distance&quot;;
277 set simthreshold &quot;3&quot;;
278
279 for $fbu in dataset FacebookUsers
280 return {
281 &quot;id&quot;: $fbu.id,
282 &quot;name&quot;: $fbu.name,
283 &quot;similar-users&quot;: for $t in dataset TweetMessages
284 let $tu := $t.user
285 where $tu.name ~= $fbu.name
286 return {
287 &quot;twitter-screenname&quot;: $tu.screen-name,
288 &quot;twitter-name&quot;: $tu.name
289 }
290 };
291</pre></div></div></div>
292<div class="section">
293<h2><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
294<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
295<div class="section">
296<h3><a name="NGram_Index"></a>NGram Index</h3>
297<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
298<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
299
300<div class="source">
301<div class="source">
302<pre> use dataverse TinySocial;
303
304 create index fbUserIdx on FacebookUsers(name) type ngram(3);
305</pre></div></div>
306<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
307<div class="section">
308<h4><a name="NGram_Index_usage_case_-_edit-distance"></a>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a></h4>
309
310<div class="source">
311<div class="source">
312<pre> use dataverse TinySocial;
313
314 for $user in dataset('FacebookUsers')
315 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
316 where $ed &lt;= 2
317 return $user
318</pre></div></div></div>
319<div class="section">
320<h4><a name="NGram_Index_usage_case_-_edit-distance-check"></a>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a></h4>
321
322<div class="source">
323<div class="source">
324<pre> use dataverse TinySocial;
325
326 for $user in dataset('FacebookUsers')
327 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
328 where $ed[0]
329 return $ed[1]
330</pre></div></div></div>
331<div class="section">
332<h4><a name="NGram_Index_usage_case_-_similarity-jaccard"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a></h4>
333
334<div class="source">
335<div class="source">
336<pre> use dataverse TinySocial;
337
338 for $user in dataset('FacebookUsers')
339 let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
340 where $sim &gt;= 0.6f
341 return $user
342</pre></div></div></div>
343<div class="section">
344<h4><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a></h4>
345
346<div class="source">
347<div class="source">
348<pre> use dataverse TinySocial;
349
350 for $user in dataset('FacebookUsers')
351 let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 0.6f)
352 where $sim[0]
353 return $user
354</pre></div></div></div>
355<div class="section">
356<h4><a name="NGram_Index_usage_case_-_contains"></a>NGram Index usage case - <a href="functions.html#contains">contains()</a></h4>
357
358<div class="source">
359<div class="source">
360<pre> use dataverse TinySocial;
361
362 for $i in dataset('FacebookMessages')
363 where contains($i.message, &quot;phone&quot;)
364 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
365</pre></div></div></div></div>
366<div class="section">
367<h3><a name="Keyword_Index"></a>Keyword Index</h3>
368<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the records with this token. The following two examples show how to create keyword index on two different types:</p>
369<div class="section">
370<h4><a name="Keyword_Index_on_String_Type"></a>Keyword Index on String Type</h4>
371
372<div class="source">
373<div class="source">
374<pre> use dataverse TinySocial;
375
376 drop index FacebookMessages.fbMessageIdx if exists;
377 create index fbMessageIdx on FacebookMessages(message) type keyword;
378
379 for $o in dataset('FacebookMessages')
380 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
381 where $jacc[0]
382 return $o
383</pre></div></div></div>
384<div class="section">
385<h4><a name="Keyword_Index_on_UnorderedList_Type"></a>Keyword Index on UnorderedList Type</h4>
386
387<div class="source">
388<div class="source">
389<pre> use dataverse TinySocial;
390
391 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
392
393 for $c in dataset('FacebookUsers')
394 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
395 where $jacc[0]
396 return $c
397</pre></div></div>
398<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
399 </div>
400 </div>
401 </div>
402
403 <hr/>
404
405 <footer>
406 <div class="container-fluid">
407 <div class="row span12">Copyright &copy; 2015
408 <a href="http://www.apache.org/">The Apache Software Foundation</a>.
409 All Rights Reserved.
410
411 </div>
412
413 <?xml version="1.0" encoding="UTF-8"?>
414<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
415 feather logo, and the Apache AsterixDB project logo are either
416 registered trademarks or trademarks of The Apache Software
417 Foundation in the United States and other countries.
418 All other marks mentioned may be trademarks or registered
419 trademarks of their respective owners.</div>
420
421
422 </div>
423 </footer>
424 </body>
425</html>