blob: 0044362f0982e8f53bab79343b7f7aba1276c722 [file] [log] [blame]
Ian Maxonbf2c56b2017-01-24 14:14:49 -08001<!DOCTYPE html>
2<!--
Ian Maxond5b11d82017-01-25 10:48:05 -08003 | Generated by Apache Maven Doxia at 2017-01-25
Ian Maxonbf2c56b2017-01-24 14:14:49 -08004 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
Ian Maxond5b11d82017-01-25 10:48:05 -080010 <meta name="Date-Revision-yyyymmdd" content="20170125" />
Ian Maxonbf2c56b2017-01-24 14:14:49 -080011 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; Support for Data Ingestion in AsterixDB</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Ian Maxonbf2c56b2017-01-24 14:14:49 -080022
Ian Maxonbf2c56b2017-01-24 14:14:49 -080023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href=".././" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
Ian Maxond5b11d82017-01-25 10:48:05 -080045 <li id="publishDate">Last Published: 2017-01-25</li>
Ian Maxonbf2c56b2017-01-24 14:14:49 -080046
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.9.0</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Get Started - Installation</li>
68
69 <li>
70
71 <a href="../ncservice.html" title="Option 1: using NCService">
72 <i class="none"></i>
73 Option 1: using NCService</a>
74 </li>
75
76 <li>
77
78 <a href="../install.html" title="Option 2: using Managix">
79 <i class="none"></i>
80 Option 2: using Managix</a>
81 </li>
82
83 <li>
84
85 <a href="../yarn.html" title="Option 3: using YARN">
86 <i class="none"></i>
87 Option 3: using YARN</a>
88 </li>
89 <li class="nav-header">AsterixDB Primer</li>
90
91 <li>
92
93 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
94 <i class="none"></i>
95 Option 1: using SQL++</a>
96 </li>
97
98 <li>
99
100 <a href="../aql/primer.html" title="Option 2: using AQL">
101 <i class="none"></i>
102 Option 2: using AQL</a>
103 </li>
104 <li class="nav-header">Data Model</li>
105
106 <li>
107
108 <a href="../datamodel.html" title="The Asterix Data Model">
109 <i class="none"></i>
110 The Asterix Data Model</a>
111 </li>
112 <li class="nav-header">Queries - SQL++</li>
113
114 <li>
115
116 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
117 <i class="none"></i>
118 The SQL++ Query Language</a>
119 </li>
120
121 <li>
122
123 <a href="../sqlpp/builtins.html" title="Builtin Functions">
124 <i class="none"></i>
125 Builtin Functions</a>
126 </li>
127 <li class="nav-header">Queries - AQL</li>
128
129 <li>
130
131 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
132 <i class="none"></i>
133 The Asterix Query Language (AQL)</a>
134 </li>
135
136 <li>
137
138 <a href="../aql/builtins.html" title="Builtin Functions">
139 <i class="none"></i>
140 Builtin Functions</a>
141 </li>
142 <li class="nav-header">Advanced Features</li>
143
144 <li>
145
146 <a href="../aql/similarity.html" title="Support of Similarity Queries">
147 <i class="none"></i>
148 Support of Similarity Queries</a>
149 </li>
150
151 <li>
152
153 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
154 <i class="none"></i>
155 Support of Full-text Queries</a>
156 </li>
157
158 <li>
159
160 <a href="../aql/externaldata.html" title="Accessing External Data">
161 <i class="none"></i>
162 Accessing External Data</a>
163 </li>
164
165 <li class="active">
166
167 <a href="#"><i class="none"></i>Support for Data Ingestion</a>
168 </li>
169
170 <li>
171
172 <a href="../udf.html" title="User Defined Functions">
173 <i class="none"></i>
174 User Defined Functions</a>
175 </li>
176
177 <li>
178
179 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
180 <i class="none"></i>
181 Filter-Based LSM Index Acceleration</a>
182 </li>
183 <li class="nav-header">API/SDK</li>
184
185 <li>
186
187 <a href="../api.html" title="HTTP API">
188 <i class="none"></i>
189 HTTP API</a>
190 </li>
191 </ul>
192
193
194
195 <hr class="divider" />
196
197 <div id="poweredBy">
198 <div class="clear"></div>
199 <div class="clear"></div>
200 <div class="clear"></div>
201 <a href=".././" title="AsterixDB" class="builtBy">
202 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
203 </a>
204 </div>
205 </div>
206 </div>
207
208
209 <div id="bodyColumn" class="span9" >
210
211 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
212 ! or more contributor license agreements. See the NOTICE file
213 ! distributed with this work for additional information
214 ! regarding copyright ownership. The ASF licenses this file
215 ! to you under the Apache License, Version 2.0 (the
216 ! "License"); you may not use this file except in compliance
217 ! with the License. You may obtain a copy of the License at
218 !
219 ! http://www.apache.org/licenses/LICENSE-2.0
220 !
221 ! Unless required by applicable law or agreed to in writing,
222 ! software distributed under the License is distributed on an
223 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
224 ! KIND, either express or implied. See the License for the
225 ! specific language governing permissions and limitations
226 ! under the License.
227 ! --><h1>Support for Data Ingestion in AsterixDB</h1>
228<div class="section">
229<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
230
231<ul>
232
233<li><a href="#Introduction">Introduction</a></li>
234
235<li><a href="#FeedAdaptors">Feed Adaptors</a></li>
236
237<li><a href="#FeedPolicies">Feed Policies</a></li>
238</ul></div>
239<div class="section">
240<h2><a name="Introduction">Introduction</a></h2>
241<p>In this document, we describe the support for data ingestion in AsterixDB. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users&#x2019; lives and system performance.</p></div>
242<div class="section">
243<h2><a name="Feed_Adaptors"></a><a name="FeedAdaptors">Feed Adaptors</a></h2>
244<p>The functionality of establishing a connection with a data source and receiving, parsing and translating its data into ADM objects (for storage inside AsterixDB) is contained in a feed adaptor. A feed adaptor is an implementation of an interface and its details are specific to a given data source. An adaptor may optionally be given parameters to configure its runtime behavior. Depending upon the data transfer protocol/APIs offered by the data source, a feed adaptor may operate in a push or a pull mode. Push mode involves just one initial request by the adaptor to the data source for setting up the connection. Once a connection is authorized, the data source &#x201c;pushes&#x201d; data to the adaptor without any subsequent requests by the adaptor. In contrast, when operating in a pull mode, the adaptor makes a separate request each time to receive data. AsterixDB currently provides built-in adaptors for several popular data sources such as Twitter, CNN, and RSS feeds. AsterixDB additionally provides a generic socket-based adaptor that can be used to ingest data that is directed at a prescribed socket.</p>
245<p>In this tutorial, we shall describe building two example data ingestion pipelines that cover the popular scenario of ingesting data from (a) Twitter and (b) RSS Feed source.</p>
246<div class="section">
247<div class="section">
248<h4><a name="Ingesting_Twitter_Stream"></a>Ingesting Twitter Stream</h4>
249<p>We shall use the built-in push-based Twitter adaptor. As a pre-requisite, we must define a Tweet using the AsterixDB Data Model (ADM) and the AsterixDB Query Language (AQL). Given below are the type definition in AQL that create a Tweet datatype which is representative of a real tweet as obtained from Twitter.</p>
250
251<div class="source">
252<div class="source">
253<pre> create dataverse feeds;
254 use dataverse feeds;
255
256 create type TwitterUser as closed {
257 screen_name: string,
258 lang: string,
259 friends_count: int32,
260 statuses_count: int32
261 };
262
263 create type Tweet as open {
264 id: int64,
265 user: TwitterUser
266 }
267
268 create dataset Tweets (Tweet)
269 primary key id;
270</pre></div></div>
271<p>We also create a dataset that we shall use to persist the tweets in AsterixDB. Next we make use of the <tt>create feed</tt> AQL statement to define our example data feed.</p>
272<div class="section">
273<h5><a name="Using_the_push_twitter_feed_adapter"></a>Using the &#x201c;push_twitter&#x201d; feed adapter</h5>
274<p>The &#x201c;push_twitter&#x201d; adaptor requires setting up an application account with Twitter. To retrieve tweets, Twitter requires registering an application with Twitter. Registration involves providing a name and a brief description for the application. Each application has an associated OAuth authentication credential that includes OAuth keys and tokens. Accessing the Twitter API requires providing the following. 1. Consumer Key (API Key) 2. Consumer Secret (API Secret) 3. Access Token 4. Access Token Secret</p>
275<p>The &#x201c;push_twitter&#x201d; adaptor takes as configuration the above mentioned parameters. End users are required to obtain the above authentication credentials prior to using the &#x201c;push_twitter&#x201d; adaptor. For further information on obtaining OAuth keys and tokens and registering an application with Twitter, please visit <a class="externalLink" href="http://apps.twitter.com">http://apps.twitter.com</a></p>
276<p>Given below is an example AQL statement that creates a feed called &#x201c;TwitterFeed&#x201d; by using the &#x201c;push_twitter&#x201d; adaptor.</p>
277
278<div class="source">
279<div class="source">
280<pre> use dataverse feeds;
281
282 create feed TwitterFeed if not exists using &quot;push_twitter&quot;
283 ((&quot;type-name&quot;=&quot;Tweet&quot;),
284 (&quot;format&quot;=&quot;twitter-status&quot;),
285 (&quot;consumer.key&quot;=&quot;************&quot;),
286 (&quot;consumer.secret&quot;=&quot;**************&quot;),
287 (&quot;access.token&quot;=&quot;**********&quot;),
288 (&quot;access.token.secret&quot;=&quot;*************&quot;));
289</pre></div></div>
290<p>It is required that the above authentication parameters are provided valid values. Note that the <tt>create feed</tt> statement does not initiate the flow of data from Twitter into our AsterixDB instance. Instead, the <tt>create feed</tt> statement only results in registering the feed with AsterixDB. The flow of data along a feed is initiated when it is connected to a target dataset using the connect feed statement (which we shall revisit later).</p></div></div>
291<div class="section">
292<h4><a name="Lifecycle_of_a_Feed"></a>Lifecycle of a Feed</h4>
293<p>A feed is a logical artifact that is brought to life (i.e., its data flow is initiated) only when it is connected to a dataset using the <tt>connect
294feed</tt> AQL statement. Subsequent to a <tt>connect feed</tt> statement, the feed is said to be in the connected state. Multiple feeds can simultaneously be connected to a dataset such that the contents of the dataset represent the union of the connected feeds. In a supported but unlikely scenario, one feed may also be simultaneously connected to different target datasets. Note that connecting a secondary feed does not require the parent feed (or any ancestor feed) to be in the connected state; the order in which feeds are connected to their respective datasets is not important. Furthermore, additional (secondary) feeds can be added to an existing hierarchy and connected to a dataset at any time without impeding/interrupting the flow of data along a connected ancestor feed.</p>
295
296<div class="source">
297<div class="source">
298<pre> use dataverse feeds;
299
300 connect feed TwitterFeed to dataset Tweets;
301</pre></div></div>
302<p>The <tt>connect feed</tt> statement above directs AsterixDB to persist the <tt>TwitterFeed</tt> feed in the <tt>Tweets</tt> dataset. If it is required (by the high-level application) to also retain the raw tweets obtained from Twitter, the end user may additionally choose to connect TwitterFeed to a different dataset.</p>
303<p>Let the feed run for a minute, then run the following query to see the latest tweets that are stored into the data set.</p>
304
305<div class="source">
306<div class="source">
307<pre> use dataverse feeds;
308
309 for $i in dataset Tweets limit 10 return $i;
310</pre></div></div>
311<p>The flow of data from a feed into a dataset can be terminated explicitly by use of the <tt>disconnect feed</tt> statement. Disconnecting a feed from a particular dataset does not interrupt the flow of data from the feed to any other dataset(s), nor does it impact other connected feeds in the lineage.</p>
312
313<div class="source">
314<div class="source">
315<pre> use dataverse feeds;
316
317 disconnect feed TwitterFeed from dataset Tweets;
318</pre></div></div></div>
319<div class="section">
320<h4><a name="Ingesting_an_RSS_Feed"></a>Ingesting an RSS Feed</h4>
321<p>RSS (Rich Site Summary), originally RDF Site Summary and often called Really Simple Syndication, uses a family of standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video. An RSS document (called &#x201c;feed&#x201d;, &#x201c;web feed&#x201d;, or &#x201c;channel&#x201d;) includes full or summarized text, and metadata, like publishing date and author&#x2019;s name. RSS feeds enable publishers to syndicate data automatically.</p>
322<div class="section">
323<h5><a name="Using_the_rss_feed_feed_adapter"></a>Using the &#x201c;rss_feed&#x201d; feed adapter</h5>
324<p>AsterixDB provides a built-in feed adaptor that allows retrieving data given a collection of RSS end point URLs. As observed in the case of ingesting tweets, it is required to model an RSS data item using AQL.</p>
325
326<div class="source">
327<div class="source">
328<pre> use dataverse feeds;
329
330 create type Rss if not exists as open {
331 id: string,
332 title: string,
333 description: string,
334 link: string
335 };
336
337 create dataset RssDataset (Rss)
338 primary key id;
339</pre></div></div>
340<p>Next, we define an RSS feed using our built-in adaptor &#x201c;rss_feed&#x201d;.</p>
341
342<div class="source">
343<div class="source">
344<pre> use dataverse feeds;
345
346 create feed my_feed using
347 rss_feed (
348 (&quot;type-name&quot;=&quot;Rss&quot;),
349 (&quot;format&quot;=&quot;rss&quot;),
350 (&quot;url&quot;=&quot;http://rss.cnn.com/rss/edition.rss&quot;)
351 );
352</pre></div></div>
353<p>In the above definition, the configuration parameter &#x201c;url&#x201d; can be a comma-separated list that reflects a collection of RSS URLs, where each URL corresponds to an RSS endpoint or a RSS feed. The &#x201c;rss_adaptor&#x201d; retrieves data from each of the specified RSS URLs (comma separated values) in parallel.</p>
354<p>The following statements connect the feed into the <tt>RssDataset</tt>:</p>
355
356<div class="source">
357<div class="source">
358<pre> use dataverse feeds;
359
360 connect feed my_feed to dataset RssDataset;
361</pre></div></div>
362<p>The following statements show the latest data from the data set, and disconnect the feed from the data set.</p>
363
364<div class="source">
365<div class="source">
366<pre> use dataverse feeds;
367
368 for $i in dataset RssDataset limit 10 return $i;
369
370 disconnect feed my_feed from dataset RssDataset;
371</pre></div></div>
372<p>AsterixDB also allows multiple feeds to be connected to form a cascade network to process data.</p></div></div></div></div>
373<div class="section">
374<h2><a name="Policies_for_Feed_Ingestion"></a><a name="FeedPolicies">Policies for Feed Ingestion</a></h2>
375<p>Multiple feeds may be concurrently operational on an AsterixDB cluster, each competing for resources (CPU cycles, network bandwidth, disk IO) to maintain pace with their respective data sources. As a data management system, AsterixDB is able to manage a set of concurrent feeds and make dynamic decisions related to the allocation of resources, resolving resource bottlenecks and the handling of failures. Each feed has its own set of constraints, influenced largely by the nature of its data source and the applications that intend to consume and process the ingested data. Consider an application that intends to discover the trending topics on Twitter by analyzing tweets that are being processed. Losing a few tweets may be acceptable. In contrast, when ingesting from a data source that provides a click-stream of ad clicks, losing data would translate to a loss of revenue for an application that tracks revenue by charging advertisers per click.</p>
376<p>AsterixDB allows a data feed to have an associated ingestion policy that is expressed as a collection of parameters and associated values. An ingestion policy dictates the runtime behavior of the feed in response to resource bottlenecks and failures. AsterixDB provides a list of policy parameters that help customize the system&#x2019;s runtime behavior when handling excess objects. AsterixDB provides a set of built-in policies, each constructed by setting appropriate value(s) for the policy parameter(s) from the table below.</p>
377<div class="section">
378<div class="section">
379<h4><a name="Policy_Parameters"></a>Policy Parameters</h4>
380
381<ul>
382
383<li>
384<p><i>excess.records.spill</i>: Set to true if objects that cannot be processed by an operator for lack of resources (referred to as excess objects hereafter) should be persisted to the local disk for deferred processing. (Default: false)</p></li>
385
386<li>
387<p><i>excess.records.discard</i>: Set to true if excess objects should be discarded. (Default: false)</p></li>
388
389<li>
390<p><i>excess.records.throttle</i>: Set to true if rate of arrival of objects is required to be reduced in an adaptive manner to prevent having any excess objects (Default: false)</p></li>
391
392<li>
393<p><i>excess.records.elastic</i>: Set to true if the system should attempt to resolve resource bottlenecks by re-structuring and/or rescheduling the feed ingestion pipeline. (Default: false)</p></li>
394
395<li>
396<p><i>recover.soft.failure</i>: Set to true if the feed must attempt to survive any runtime exception. A false value permits an early termination of a feed in such an event. (Default: true)</p></li>
397
398<li>
399<p><i>recover.soft.failure</i>: Set to true if the feed must attempt to survive a hardware failures (loss of AsterixDB node(s)). A false value permits the early termination of a feed in the event of a hardware failure (Default: false)</p></li>
400</ul>
401<p>Note that the end user may choose to form a custom policy. For example, it is possible in AsterixDB to create a custom policy that spills excess objects to disk and subsequently resorts to throttling if the spillage crosses a configured threshold. In all cases, the desired ingestion policy is specified as part of the <tt>connect feed</tt> statement or else the &#x201c;Basic&#x201d; policy will be chosen as the default. It is worth noting that a feed can be connected to a dataset at any time, which is independent from other related feeds in the hierarchy.</p>
402
403<div class="source">
404<div class="source">
405<pre> use dataverse feeds;
406
407 connect feed TwitterFeed to dataset Tweets
408 using policy Basic ;
409</pre></div></div></div></div></div>
410 </div>
411 </div>
412 </div>
413
414 <hr/>
415
416 <footer>
417 <div class="container-fluid">
418 <div class="row span12">Copyright &copy; 2017
419 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
420 All Rights Reserved.
421
422 </div>
423
424 <?xml version="1.0" encoding="UTF-8"?>
425<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
426 feather logo, and the Apache AsterixDB project logo are either
427 registered trademarks or trademarks of The Apache Software
428 Foundation in the United States and other countries.
429 All other marks mentioned may be trademarks or registered
430 trademarks of their respective owners.</div>
431
432
433 </div>
434 </footer>
435 </body>
436</html>