blob: 82c27cae5c1576464700383e408ad42c56357a94 [file] [log] [blame]
Ian Maxoned124d82015-05-29 18:44:11 -07001<!DOCTYPE html>
2<!--
Till Westmann0817a3f2015-06-03 21:08:18 -07003 | Generated by Apache Maven Doxia at 2015-05-31
Ian Maxoned124d82015-05-29 18:44:11 -07004 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
Till Westmann0817a3f2015-06-03 21:08:18 -070010 <meta name="Date-Revision-yyyymmdd" content="20150531" />
Ian Maxoned124d82015-05-29 18:44:11 -070011 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB - </title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21 </head>
22 <body class="topBarDisabled">
23
24
25
26
27 <div class="container-fluid">
28 <div id="banner">
29 <div class="pull-left">
30 <a href=".././" id="bannerLeft">
31 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
32 </a>
33 </div>
34 <div class="pull-right"> <a href="http://incubator.apache.org/" id="bannerRight">
35 <img src="../images/egg-logo.png" alt="Apache Software Foundation Incubator"/>
36 </a>
37 </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
Till Westmann0817a3f2015-06-03 21:08:18 -070045 <li id="publishDate">Last Published: 2015-05-31</li>
Ian Maxoned124d82015-05-29 18:44:11 -070046
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.8.7-SNAPSHOT</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Home">
54 Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Apache Software Foundation</li>
68
69 <li>
70
71 <a href="http://www.apache.org/" class="externalLink" title="Home">
72 <i class="none"></i>
73 Home</a>
74 </li>
75
76 <li>
77
78 <a href="http://www.apache.org/foundation/sponsorship.html" class="externalLink" title="Donate">
79 <i class="none"></i>
80 Donate</a>
81 </li>
82
83 <li>
84
85 <a href="http://www.apache.org/foundation/thanks.html" class="externalLink" title="Thanks">
86 <i class="none"></i>
87 Thanks</a>
88 </li>
89
90 <li>
91
92 <a href="http://www.apache.org/security/" class="externalLink" title="Security">
93 <i class="none"></i>
94 Security</a>
95 </li>
96 <li class="nav-header">User Documentation</li>
97
98 <li>
99
100 <a href="../install.html" title="Installing and Managing AsterixDB using Managix">
101 <i class="none"></i>
102 Installing and Managing AsterixDB using Managix</a>
103 </li>
104
105 <li>
106
107 <a href="../aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
108 <i class="none"></i>
109 AsterixDB 101: An ADM and AQL Primer</a>
110 </li>
111
112 <li>
113
114 <a href="../aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
115 <i class="none"></i>
116 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
117 </li>
118
119 <li>
120
121 <a href="../aql/js-sdk.html" title="AsterixDB Javascript SDK">
122 <i class="none"></i>
123 AsterixDB Javascript SDK</a>
124 </li>
125
126 <li>
127
128 <a href="../aql/datamodel.html" title="Asterix Data Model (ADM)">
129 <i class="none"></i>
130 Asterix Data Model (ADM)</a>
131 </li>
132
133 <li>
134
135 <a href="../aql/manual.html" title="Asterix Query Language (AQL)">
136 <i class="none"></i>
137 Asterix Query Language (AQL)</a>
138 </li>
139
140 <li>
141
142 <a href="../aql/functions.html" title="AQL Functions">
143 <i class="none"></i>
144 AQL Functions</a>
145 </li>
146
147 <li>
148
149 <a href="../aql/allens.html" title="AQL Allen's Relations Functions">
150 <i class="none"></i>
151 AQL Allen's Relations Functions</a>
152 </li>
153
154 <li class="active">
155
156 <a href="#"><i class="none"></i>AQL Support of Similarity Queries</a>
157 </li>
158
159 <li>
160
161 <a href="../aql/externaldata.html" title="Accessing External Data">
162 <i class="none"></i>
163 Accessing External Data</a>
164 </li>
165
166 <li>
167
168 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
169 <i class="none"></i>
170 Filter-Based LSM Index Acceleration</a>
171 </li>
172
173 <li>
174
175 <a href="../api.html" title="REST API to AsterixDB">
176 <i class="none"></i>
177 REST API to AsterixDB</a>
178 </li>
179 </ul>
180
181
182
183 <hr class="divider" />
184
185 <div id="poweredBy">
186 <div class="clear"></div>
187 <div class="clear"></div>
188 <div class="clear"></div>
189 <a href=".././" title="Hyracks" class="builtBy">
190 <img class="builtBy" alt="Hyracks" src="../images/hyrax_ts.png" />
191 </a>
192 </div>
193 </div>
194 </div>
195
196
197 <div id="bodyColumn" class="span9" >
198
199 <h1>AsterixDB Support of Similarity Queries</h1>
200<div class="section">
201<h2><a name="toc" id="toc">Table of Contents</a><a name="Table_of_Contents"></a></h2>
202
203<ul>
204
205<li><a href="#Motivation">Motivation</a></li>
206
207<li><a href="#DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a></li>
208
209<li><a href="#SimilaritySelectionQueries">Similarity Selection Queries</a></li>
210
211<li><a href="#SimilarityJoinQueries">Similarity Join Queries</a></li>
212
213<li><a href="#UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a></li>
214</ul></div>
215<div class="section">
216<h2><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font><a name="Motivation_Back_to_TOC"></a></h2>
217<p>Similarity queries are widely used in applications where users need to find records that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies starring Schwarzenegger, while we don&#x2019;t know the exact spelling of his last name (despite his popularity in both the movie industry and politics :-)). As another example, we want to find all the Facebook users who have similar friends. To meet this type of needs, AsterixDB supports similarity queries using efficient indexes and algorithms.</p></div>
218<div class="section">
219<h2><a name="DataTypesAndSimilarityFunctions" id="DataTypesAndSimilarityFunctions">Data Types and Similarity Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font><a name="Data_Types_and_Similarity_Functions_Back_to_TOC"></a></h2>
220<p>AsterixDB supports <a class="externalLink" href="http://en.wikipedia.org/wiki/Levenshtein_distance">edit distance</a> (on strings) and <a class="externalLink" href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard</a> (on sets). For instance, in our <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">TinySocial</a> example, the <tt>friend-ids</tt> of a Facebook user forms a set of friends, and we can define a similarity between the sets of friends of two users. We can also convert a string to a set of grams of a length &#x201c;n&#x201d; (called &#x201c;n-grams&#x201d;) and define the Jaccard similarity between the two gram sets of the two strings. Formally, the &#x201c;n-grams&#x201d; of a string are its substrings of length &#x201c;n&#x201d;. For instance, the 3-grams of the string <tt>schwarzenegger</tt> are <tt>sch</tt>, <tt>chw</tt>, <tt>hwa</tt>, &#x2026;, <tt>ger</tt>.</p>
221<p>AsterixDB provides <a href="functions.html#Tokenizing_Functions">tokenization functions</a> to convert strings to sets, and the <a href="functions.html#Similarity_Functions">similarity functions</a>.</p></div>
222<div class="section">
223<h2><a name="SimilaritySelectionQueries" id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font><a name="Similarity_Selection_Queries_Back_to_TOC"></a></h2>
224<p>The following <a href="functions.html#edit-distance">query</a> asks for all the Facebook users whose name is similar to <tt>Suzanna Tilson</tt>, i.e., their edit distance is at most 2.</p>
225
226<div class="source">
227<pre> use dataverse TinySocial;
228
229 for $user in dataset('FacebookUsers')
230 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
231 where $ed &lt;= 2
232 return $user
233</pre></div>
234<p>The following <a href="functions.html#similarity-jaccard">query</a> asks for all the Facebook users whose set of friend ids is similar to <tt>[1,5,9]</tt>, i.e., their Jaccard similarity is at least 0.6.</p>
235
236<div class="source">
237<pre> use dataverse TinySocial;
238
239 for $user in dataset('FacebookUsers')
240 let $sim := similarity-jaccard($user.friend-ids, [1,5,9])
241 where $sim &gt;= 0.6f
242 return $user
243</pre></div>
244<p>AsterixDB allows a user to use a similarity operator <tt>~=</tt> to express a condition by defining the similarity function and threshold using &#x201c;set&#x201d; statements earlier. For instance, the above query can be equivalently written as:</p>
245
246<div class="source">
247<pre> use dataverse TinySocial;
248
249 set simfunction &quot;jaccard&quot;;
250 set simthreshold &quot;0.6f&quot;;
251
252 for $user in dataset('FacebookUsers')
253 where $user.friend-ids ~= [1,5,9]
254 return $user
255</pre></div>
256<p>In this query, we first declare Jaccard as the similarity function using <tt>simfunction</tt> and then specify the threshold <tt>0.6f</tt> using <tt>simthreshold</tt>.</p></div>
257<div class="section">
258<h2><a name="SimilarityJoinQueries" id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font><a name="Similarity_Join_Queries_Back_to_TOC"></a></h2>
259<p>AsterixDB supports fuzzy joins between two sets. The following <a href="primer.html#Query_5_-_Fuzzy_Join">query</a> finds, for each Facebook user, all Twitter users with names similar to their name based on the edit distance.</p>
260
261<div class="source">
262<pre> use dataverse TinySocial;
263
264 set simfunction &quot;edit-distance&quot;;
265 set simthreshold &quot;3&quot;;
266
267 for $fbu in dataset FacebookUsers
268 return {
269 &quot;id&quot;: $fbu.id,
270 &quot;name&quot;: $fbu.name,
271 &quot;similar-users&quot;: for $t in dataset TweetMessages
272 let $tu := $t.user
273 where $tu.name ~= $fbu.name
274 return {
275 &quot;twitter-screenname&quot;: $tu.screen-name,
276 &quot;twitter-name&quot;: $tu.name
277 }
278 };
279</pre></div></div>
280<div class="section">
281<h2><a name="UsingIndexesToSupportSimilarityQueries" id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font><a name="Using_Indexes_to_Support_Similarity_Queries_Back_to_TOC"></a></h2>
282<p>AsterixDB uses two types of indexes to support similarity queries, namely &#x201c;ngram index&#x201d; and &#x201c;keyword index&#x201d;.</p>
283<div class="section">
284<h3>NGram Index<a name="NGram_Index"></a></h3>
285<p>An &#x201c;ngram index&#x201d; is constructed on a set of strings. We generate n-grams for each string, and build an inverted list for each n-gram that includes the ids of the strings with this gram. A similarity query can be answered efficiently by accessing the inverted lists of the grams in the query and counting the number of occurrences of the string ids on these inverted lists. The similar idea can be used to answer queries with Jaccard similarity. A detailed description of these techniques is available at this <a class="externalLink" href="http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf">paper</a>.</p>
286<p>For instance, the following DDL statements create an ngram index on the <tt>FacebookUsers.name</tt> attribute using an inverted index of 3-grams.</p>
287
288<div class="source">
289<pre> use dataverse TinySocial;
290
291 create index fbUserIdx on FacebookUsers(name) type ngram(3);
292</pre></div>
293<p>The number &#x201c;3&#x201d; in &#x201c;ngram(3)&#x201d; is the length &#x201c;n&#x201d; in the grams. This index can be used to optimize similarity queries on this attribute using <a href="functions.html#edit-distance">edit-distance</a>, <a href="functions.html#edit-distance-check">edit-distance-check</a>, <a href="functions.html#similarity-jaccard">similarity-jaccard</a>, or <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a> queries on this attribute where the similarity is defined on sets of 3-grams. This index can also be used to optimize queries with the &#x201c;<a href="functions.html#contains">contains()</a>&#x201d; predicate (i.e., substring matching) since it can be also be solved by counting on the inverted lists of the grams in the query string.</p>
294<div class="section">
295<h4>NGram Index usage case - <a href="functions.html#edit-distance">edit-distance</a><a name="NGram_Index_usage_case_-_edit-distance"></a></h4>
296
297<div class="source">
298<pre> use dataverse TinySocial;
299
300 for $user in dataset('FacebookUsers')
301 let $ed := edit-distance($user.name, &quot;Suzanna Tilson&quot;)
302 where $ed &lt;= 2
303 return $user
304</pre></div></div>
305<div class="section">
306<h4>NGram Index usage case - <a href="functions.html#edit-distance-check">edit-distance-check</a><a name="NGram_Index_usage_case_-_edit-distance-check"></a></h4>
307
308<div class="source">
309<pre> use dataverse TinySocial;
310
311 for $user in dataset('FacebookUsers')
312 let $ed := edit-distance-check($user.name, &quot;Suzanna Tilson&quot;, 2)
313 where $ed[0]
314 return $ed[1]
315</pre></div></div>
316<div class="section">
317<h4>NGram Index usage case - <a href="functions.html#similarity-jaccard">similarity-jaccard</a><a name="NGram_Index_usage_case_-_similarity-jaccard"></a></h4>
318
319<div class="source">
320<pre> use dataverse TinySocial;
321
322 for $user in dataset('FacebookUsers')
323 let $sim := similarity-jaccard($user.friend-ids, [1,5,9])
324 where $sim &gt;= 0.6f
325 return $user
326</pre></div></div>
327<div class="section">
328<h4>NGram Index usage case - <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a><a name="NGram_Index_usage_case_-_similarity-jaccard-check"></a></h4>
329
330<div class="source">
331<pre> use dataverse TinySocial;
332
333 for $user in dataset('FacebookUsers')
334 let $sim := similarity-jaccard($user.friend-ids, [1,5,9])
335 where $sim &gt;= 0.6f
336 return $user
337</pre></div></div>
338<div class="section">
339<h4>NGram Index usage case - <a href="functions.html#contains">contains()</a><a name="NGram_Index_usage_case_-_contains"></a></h4>
340
341<div class="source">
342<pre> use dataverse TinySocial;
343
344 for $i in dataset('FacebookMessages')
345 where contains($i.message, &quot;phone&quot;)
346 return {&quot;mid&quot;: $i.message-id, &quot;message&quot;: $i.message}
347</pre></div></div></div>
348<div class="section">
349<h3>Keyword Index<a name="Keyword_Index"></a></h3>
350<p>A &#x201c;keyword index&#x201d; is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the records with this token. The following two examples show how to create keyword index on two different types:</p>
351<div class="section">
352<h4>Keyword Index on String Type<a name="Keyword_Index_on_String_Type"></a></h4>
353
354<div class="source">
355<pre> use dataverse TinySocial;
356
357 create index fbMessageIdx on FacebookMessages(message) type keyword;
358
359 for $o in dataset('FacebookMessages')
360 let $jacc := similarity-jaccard-check(word-tokens($o.message), word-tokens(&quot;love like verizon&quot;), 0.2f)
361 where $jacc[0]
362 return $o
363</pre></div></div>
364<div class="section">
365<h4>Keyword Index on UnorderedList Type<a name="Keyword_Index_on_UnorderedList_Type"></a></h4>
366
367<div class="source">
368<pre> use dataverse TinySocial;
369
370 create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
371
372 for $c in dataset('FacebookUsers')
373 let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
374 where $jacc[0]
375 return $c
376</pre></div>
377<p>As shown above, keyword index can be used to optimize queries with token-based similarity predicates, including <a href="functions.html#similarity-jaccard">similarity-jaccard</a> and <a href="functions.html#similarity-jaccard-check">similarity-jaccard-check</a>.</p></div></div></div>
378 </div>
379 </div>
380 </div>
381
382 <hr/>
383
384 <footer>
385 <div class="container-fluid">
386 <div class="row span12">Copyright &copy; 2015.
387 All Rights Reserved.
388
389 </div>
390
391 <?xml version="1.0" encoding="UTF-8"?>
392<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
393 feather logo, and the Apache AsterixDB project logo are either
394 registered trademarks or trademarks of The Apache Software
395 Foundation in the United States and other countries.
396 All other marks mentioned may be trademarks or registered
397 trademarks of their respective owners.</div>
398
399
400 </div>
401 </footer>
402 </body>
403</html>