Ian Maxon | 100cb80 | 2017-04-24 18:48:07 -0700 | [diff] [blame^] | 1 | <!DOCTYPE html> |
| 2 | <!-- |
| 3 | | Generated by Apache Maven Doxia at 2017-04-24 |
| 4 | | Rendered using Apache Maven Fluido Skin 1.3.0 |
| 5 | --> |
| 6 | <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> |
| 7 | <head> |
| 8 | <meta charset="UTF-8" /> |
| 9 | <meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
| 10 | <meta name="Date-Revision-yyyymmdd" content="20170424" /> |
| 11 | <meta http-equiv="Content-Language" content="en" /> |
| 12 | <title>AsterixDB – AsterixDB Support of Full-text search queries</title> |
| 13 | <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" /> |
| 14 | <link rel="stylesheet" href="../css/site.css" /> |
| 15 | <link rel="stylesheet" href="../css/print.css" media="print" /> |
| 16 | |
| 17 | |
| 18 | <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script> |
| 19 | |
| 20 | |
| 21 | |
| 22 | <script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| 23 | (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| 24 | m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| 25 | })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| 26 | |
| 27 | ga('create', 'UA-41536543-1', 'uci.edu'); |
| 28 | ga('send', 'pageview');</script> |
| 29 | |
| 30 | </head> |
| 31 | <body class="topBarDisabled"> |
| 32 | |
| 33 | |
| 34 | |
| 35 | |
| 36 | <div class="container-fluid"> |
| 37 | <div id="banner"> |
| 38 | <div class="pull-left"> |
| 39 | <a href=".././" id="bannerLeft"> |
| 40 | <img src="../images/asterixlogo.png" alt="AsterixDB"/> |
| 41 | </a> |
| 42 | </div> |
| 43 | <div class="pull-right"> </div> |
| 44 | <div class="clear"><hr/></div> |
| 45 | </div> |
| 46 | |
| 47 | <div id="breadcrumbs"> |
| 48 | <ul class="breadcrumb"> |
| 49 | |
| 50 | |
| 51 | <li id="publishDate">Last Published: 2017-04-24</li> |
| 52 | |
| 53 | |
| 54 | |
| 55 | <li id="projectVersion" class="pull-right">Version: 0.9.1</li> |
| 56 | |
| 57 | <li class="divider pull-right">|</li> |
| 58 | |
| 59 | <li class="pull-right"> <a href="../index.html" title="Documentation Home"> |
| 60 | Documentation Home</a> |
| 61 | </li> |
| 62 | |
| 63 | </ul> |
| 64 | </div> |
| 65 | |
| 66 | |
| 67 | <div class="row-fluid"> |
| 68 | <div id="leftColumn" class="span3"> |
| 69 | <div class="well sidebar-nav"> |
| 70 | |
| 71 | |
| 72 | <ul class="nav nav-list"> |
| 73 | <li class="nav-header">Get Started - Installation</li> |
| 74 | |
| 75 | <li> |
| 76 | |
| 77 | <a href="../ncservice.html" title="Option 1: using NCService"> |
| 78 | <i class="none"></i> |
| 79 | Option 1: using NCService</a> |
| 80 | </li> |
| 81 | |
| 82 | <li> |
| 83 | |
| 84 | <a href="../ansible.html" title="Option 2: using Ansible"> |
| 85 | <i class="none"></i> |
| 86 | Option 2: using Ansible</a> |
| 87 | </li> |
| 88 | |
| 89 | <li> |
| 90 | |
| 91 | <a href="../aws.html" title="Option 3: using Amazon Web Services"> |
| 92 | <i class="none"></i> |
| 93 | Option 3: using Amazon Web Services</a> |
| 94 | </li> |
| 95 | |
| 96 | <li> |
| 97 | |
| 98 | <a href="../yarn.html" title="Option 4: using YARN"> |
| 99 | <i class="none"></i> |
| 100 | Option 4: using YARN</a> |
| 101 | </li> |
| 102 | |
| 103 | <li> |
| 104 | |
| 105 | <a href="../install.html" title="Option 5: using Managix (deprecated)"> |
| 106 | <i class="none"></i> |
| 107 | Option 5: using Managix (deprecated)</a> |
| 108 | </li> |
| 109 | <li class="nav-header">AsterixDB Primer</li> |
| 110 | |
| 111 | <li> |
| 112 | |
| 113 | <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++"> |
| 114 | <i class="none"></i> |
| 115 | Option 1: using SQL++</a> |
| 116 | </li> |
| 117 | |
| 118 | <li> |
| 119 | |
| 120 | <a href="../aql/primer.html" title="Option 2: using AQL"> |
| 121 | <i class="none"></i> |
| 122 | Option 2: using AQL</a> |
| 123 | </li> |
| 124 | <li class="nav-header">Data Model</li> |
| 125 | |
| 126 | <li> |
| 127 | |
| 128 | <a href="../datamodel.html" title="The Asterix Data Model"> |
| 129 | <i class="none"></i> |
| 130 | The Asterix Data Model</a> |
| 131 | </li> |
| 132 | <li class="nav-header">Queries - SQL++</li> |
| 133 | |
| 134 | <li> |
| 135 | |
| 136 | <a href="../sqlpp/manual.html" title="The SQL++ Query Language"> |
| 137 | <i class="none"></i> |
| 138 | The SQL++ Query Language</a> |
| 139 | </li> |
| 140 | |
| 141 | <li> |
| 142 | |
| 143 | <a href="../sqlpp/builtins.html" title="Builtin Functions"> |
| 144 | <i class="none"></i> |
| 145 | Builtin Functions</a> |
| 146 | </li> |
| 147 | <li class="nav-header">Queries - AQL</li> |
| 148 | |
| 149 | <li> |
| 150 | |
| 151 | <a href="../aql/manual.html" title="The Asterix Query Language (AQL)"> |
| 152 | <i class="none"></i> |
| 153 | The Asterix Query Language (AQL)</a> |
| 154 | </li> |
| 155 | |
| 156 | <li> |
| 157 | |
| 158 | <a href="../aql/builtins.html" title="Builtin Functions"> |
| 159 | <i class="none"></i> |
| 160 | Builtin Functions</a> |
| 161 | </li> |
| 162 | <li class="nav-header">API/SDK</li> |
| 163 | |
| 164 | <li> |
| 165 | |
| 166 | <a href="../api.html" title="HTTP API"> |
| 167 | <i class="none"></i> |
| 168 | HTTP API</a> |
| 169 | </li> |
| 170 | |
| 171 | <li> |
| 172 | |
| 173 | <a href="../csv.html" title="CSV Output"> |
| 174 | <i class="none"></i> |
| 175 | CSV Output</a> |
| 176 | </li> |
| 177 | <li class="nav-header">Advanced Features</li> |
| 178 | |
| 179 | <li class="active"> |
| 180 | |
| 181 | <a href="#"><i class="none"></i>Support of Full-text Queries</a> |
| 182 | </li> |
| 183 | |
| 184 | <li> |
| 185 | |
| 186 | <a href="../aql/externaldata.html" title="Accessing External Data"> |
| 187 | <i class="none"></i> |
| 188 | Accessing External Data</a> |
| 189 | </li> |
| 190 | |
| 191 | <li> |
| 192 | |
| 193 | <a href="../feeds/tutorial.html" title="Support for Data Ingestion"> |
| 194 | <i class="none"></i> |
| 195 | Support for Data Ingestion</a> |
| 196 | </li> |
| 197 | |
| 198 | <li> |
| 199 | |
| 200 | <a href="../udf.html" title="User Defined Functions"> |
| 201 | <i class="none"></i> |
| 202 | User Defined Functions</a> |
| 203 | </li> |
| 204 | |
| 205 | <li> |
| 206 | |
| 207 | <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration"> |
| 208 | <i class="none"></i> |
| 209 | Filter-Based LSM Index Acceleration</a> |
| 210 | </li> |
| 211 | |
| 212 | <li> |
| 213 | |
| 214 | <a href="../aql/similarity.html" title="Support of Similarity Queries"> |
| 215 | <i class="none"></i> |
| 216 | Support of Similarity Queries</a> |
| 217 | </li> |
| 218 | </ul> |
| 219 | |
| 220 | |
| 221 | |
| 222 | <hr class="divider" /> |
| 223 | |
| 224 | <div id="poweredBy"> |
| 225 | <div class="clear"></div> |
| 226 | <div class="clear"></div> |
| 227 | <div class="clear"></div> |
| 228 | <a href=".././" title="AsterixDB" class="builtBy"> |
| 229 | <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" /> |
| 230 | </a> |
| 231 | </div> |
| 232 | </div> |
| 233 | </div> |
| 234 | |
| 235 | |
| 236 | <div id="bodyColumn" class="span9" > |
| 237 | |
| 238 | <!-- ! Licensed to the Apache Software Foundation (ASF) under one |
| 239 | ! or more contributor license agreements. See the NOTICE file |
| 240 | ! distributed with this work for additional information |
| 241 | ! regarding copyright ownership. The ASF licenses this file |
| 242 | ! to you under the Apache License, Version 2.0 (the |
| 243 | ! "License"); you may not use this file except in compliance |
| 244 | ! with the License. You may obtain a copy of the License at |
| 245 | ! |
| 246 | ! http://www.apache.org/licenses/LICENSE-2.0 |
| 247 | ! |
| 248 | ! Unless required by applicable law or agreed to in writing, |
| 249 | ! software distributed under the License is distributed on an |
| 250 | ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 251 | ! KIND, either express or implied. See the License for the |
| 252 | ! specific language governing permissions and limitations |
| 253 | ! under the License. |
| 254 | ! --><h1>AsterixDB Support of Full-text search queries</h1> |
| 255 | <div class="section"> |
| 256 | <h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2> |
| 257 | |
| 258 | <ul> |
| 259 | |
| 260 | <li><a href="#Motivation">Motivation</a></li> |
| 261 | |
| 262 | <li><a href="#Syntax">Syntax</a></li> |
| 263 | |
| 264 | <li><a href="#FulltextIndex">Creating and utilizing a Full-text index</a></li> |
| 265 | </ul></div> |
| 266 | <div class="section"> |
| 267 | <h2><a name="Motivation_Back_to_TOC"></a><a name="Motivation" id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2> |
| 268 | <p>Full-Text Search (FTS) queries are widely used in applications where users need to find records that satisfy an FTS predicate, i.e., where simple string-based matching is not sufficient. These queries are important when finding documents that contain a certain keyword is crucial. FTS queries are different from substring matching queries in that FTS queries find their query predicates as exact keywords in the given string, rather than treating a query predicate as a sequence of characters. For example, an FTS query that finds “rain” correctly returns a document when it contains “rain” as a word. However, a substring-matching query returns a document whenever it contains “rain” as a substring, for instance, a document with “brain” or “training” would be returned as well.</p></div> |
| 269 | <div class="section"> |
| 270 | <h2><a name="Syntax_Back_to_TOC"></a><a name="Syntax" id="Syntax">Syntax</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2> |
| 271 | <p>The syntax of AsterixDB FTS follows a portion of the XQuery FullText Search syntax. Two basic forms are as follows:</p> |
| 272 | |
| 273 | <div class="source"> |
| 274 | <div class="source"> |
| 275 | <pre> ftcontains(Expression1, Expression2, {FullTextOption}) |
| 276 | ftcontains(Expression1, Expression2) |
| 277 | </pre></div></div> |
| 278 | <p>For example, we can execute the following query to find tweet messages where the <tt>message-text</tt> field includes “voice” as a word. Please note that an FTS search is case-insensitive. Thus, “Voice” or “voice” will be evaluated as the same word.</p> |
| 279 | |
| 280 | <div class="source"> |
| 281 | <div class="source"> |
| 282 | <pre> use dataverse TinySocial; |
| 283 | |
| 284 | for $msg in dataset TweetMessages |
| 285 | where ftcontains($msg.message-text, "voice", {"mode":"any"}) |
| 286 | return {"id": $msg.id} |
| 287 | </pre></div></div> |
| 288 | <p>The DDL and DML of TinySocial can be found in <a href="primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB">ADM: Modeling Semistructed Data in AsterixDB</a>.</p> |
| 289 | <p>The same query can be also expressed in the SQL++.</p> |
| 290 | |
| 291 | <div class="source"> |
| 292 | <div class="source"> |
| 293 | <pre> use TinySocial; |
| 294 | |
| 295 | select element {"id":msg.id} |
| 296 | from TweetMessages as msg |
| 297 | where TinySocial.ftcontains(msg.`message-text`, "voice", {"mode":"any"}) |
| 298 | </pre></div></div> |
| 299 | <p>The <tt>Expression1</tt> is an expression that should be evaluable as a string at runtime as in the above example where <tt>$msg.message-text</tt> is a string field. The <tt>Expression2</tt> can be a string, an (un)ordered list of string value(s), or an expression. In the last case, the given expression should be evaluable into one of the first two types, i.e., into a string value or an (un)ordered list of string value(s).</p> |
| 300 | <p>The following examples are all valid expressions.</p> |
| 301 | |
| 302 | <div class="source"> |
| 303 | <div class="source"> |
| 304 | <pre> ... where ftcontains($msg.message-text, "sound") |
| 305 | ... where ftcontains($msg.message-text, "sound", {"mode":"any"}) |
| 306 | ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"any"}) |
| 307 | ... where ftcontains($msg.message-text, {{"speed", "stand", "customization"}}, {"mode":"all"}) |
| 308 | ... where ftcontains($msg.message-text, let $keyword_list := ["voice", "system"] return $keyword_list, {"mode":"all"}) |
| 309 | ... where ftcontains($msg.message-text, $keyword_list, {"mode":"any"}) |
| 310 | </pre></div></div> |
| 311 | <p>In the last example above, <tt>$keyword_list</tt> should evaluate to a string or an (un)ordered list of string value(s).</p> |
| 312 | <p>The last <tt>FullTextOption</tt> parameter clarifies the given FTS request. If you omit the <tt>FullTextOption</tt> parameter, then the default value will be set for each possible option. Currently, we only have one option named <tt>mode</tt>. And as we extend the FTS feature, more options will be added. Please note that the format of <tt>FullTextOption</tt> is a record, thus you need to put the option(s) in a record <tt>{}</tt>. The <tt>mode</tt> option indicates whether the given FTS query is a conjunctive (AND) or disjunctive (OR) search request. This option can be either <tt>“any”</tt> or <tt>“all”</tt>. The default value for <tt>mode</tt> is <tt>“all”</tt>. If one specifies <tt>“any”</tt>, a disjunctive search will be conducted. For example, the following query will find documents whose <tt>message-text</tt> field contains “sound” or “system”, so a document will be returned if it contains either “sound”, “system”, or both of the keywords.</p> |
| 313 | |
| 314 | <div class="source"> |
| 315 | <div class="source"> |
| 316 | <pre> ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"any"}) |
| 317 | </pre></div></div> |
| 318 | <p>The other option parameter,<tt>“all”</tt>, specifies a conjunctive search. The following examples will find the documents whose <tt>message-text</tt> field contains both “sound” and “system”. If a document contains only “sound” or “system” but not both, it will not be returned.</p> |
| 319 | |
| 320 | <div class="source"> |
| 321 | <div class="source"> |
| 322 | <pre> ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"all"}) |
| 323 | ... where ftcontains($msg.message-text, ["sound", "system"]) |
| 324 | </pre></div></div> |
| 325 | <p>Currently AsterixDB doesn’t (yet) support phrase searches, so the following query will not work.</p> |
| 326 | |
| 327 | <div class="source"> |
| 328 | <div class="source"> |
| 329 | <pre> ... where ftcontains($msg.message-text, "sound system", {"mode":"any"}) |
| 330 | </pre></div></div> |
| 331 | <p>As a workaround solution, the following query can be used to achieve a roughly similar goal. The difference is that the following queries will find documents where <tt>$msg.message-text</tt> contains both “sound” and “system”, but the order and adjacency of “sound” and “system” are not checked, unlike in a phrase search. As a result, the query below would also return documents with “sound system can be installed.”, “system sound is perfect.”, or “sound is not clear. You may need to install a new system.”</p> |
| 332 | |
| 333 | <div class="source"> |
| 334 | <div class="source"> |
| 335 | <pre> ... where ftcontains($msg.message-text, ["sound", "system"], {"mode":"all"}) |
| 336 | ... where ftcontains($msg.message-text, ["sound", "system"]) |
| 337 | </pre></div></div></div> |
| 338 | <div class="section"> |
| 339 | <h2><a name="Creating_and_utilizing_a_Full-text_index_Back_to_TOC"></a><a name="FulltextIndex" id="FulltextIndex">Creating and utilizing a Full-text index</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2> |
| 340 | <p>When there is a full-text index on the field that is being searched, rather than scanning all records, AsterixDB can utilize that index to expedite the execution of a FTS query. To create a full-text index, you need to specify the index type as <tt>fulltext</tt> in your DDL statement. For instance, the following DDL statement create a full-text index on the TweetMessages.message-text attribute.</p> |
| 341 | |
| 342 | <div class="source"> |
| 343 | <div class="source"> |
| 344 | <pre>create index messageFTSIdx on TweetMessages(message-text) type fulltext; |
| 345 | </pre></div></div></div> |
| 346 | </div> |
| 347 | </div> |
| 348 | </div> |
| 349 | |
| 350 | <hr/> |
| 351 | |
| 352 | <footer> |
| 353 | <div class="container-fluid"> |
| 354 | <div class="row span12">Copyright © 2017 |
| 355 | <a href="https://www.apache.org/">The Apache Software Foundation</a>. |
| 356 | All Rights Reserved. |
| 357 | |
| 358 | </div> |
| 359 | |
| 360 | <?xml version="1.0" encoding="UTF-8"?> |
| 361 | <div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache |
| 362 | feather logo, and the Apache AsterixDB project logo are either |
| 363 | registered trademarks or trademarks of The Apache Software |
| 364 | Foundation in the United States and other countries. |
| 365 | All other marks mentioned may be trademarks or registered |
| 366 | trademarks of their respective owners.</div> |
| 367 | |
| 368 | |
| 369 | </div> |
| 370 | </footer> |
| 371 | </body> |
| 372 | </html> |