blob: 346608f229d9669367afb861e36239d3b21a4797 [file] [log] [blame]
Ian Maxona1cc51b2020-08-07 13:11:35 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia Site Renderer 1.8.1 from target/generated-site/markdown/feeds.md at 2020-08-07
4 | Rendered using Apache Maven Fluido Skin 1.7
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20200807" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; Data Ingestion with Feeds</title>
13 <link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16 <script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
17
18 </head>
19 <body class="topBarDisabled">
20 <div class="container-fluid">
21 <div id="banner">
22 <div class="pull-left"><a href="./" id="bannerLeft"><img src="images/asterixlogo.png" alt="AsterixDB"/></a></div>
23 <div class="pull-right"></div>
24 <div class="clear"><hr/></div>
25 </div>
26
27 <div id="breadcrumbs">
28 <ul class="breadcrumb">
29 <li id="publishDate">Last Published: 2020-08-07</li>
30 <li id="projectVersion" class="pull-right">Version: 0.9.5</li>
31 <li class="pull-right"><a href="index.html" title="Documentation Home">Documentation Home</a></li>
32 </ul>
33 </div>
34 <div class="row-fluid">
35 <div id="leftColumn" class="span2">
36 <div class="well sidebar-nav">
37 <ul class="nav nav-list">
38 <li class="nav-header">Get Started - Installation</li>
39 <li><a href="ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
40 <li><a href="ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
41 <li><a href="aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
42 <li class="nav-header">AsterixDB Primer</li>
43 <li><a href="sqlpp/primer-sqlpp.html" title="Using SQL++"><span class="none"></span>Using SQL++</a></li>
44 <li class="nav-header">Data Model</li>
45 <li><a href="datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
46 <li class="nav-header">Queries</li>
47 <li><a href="sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
48 <li><a href="sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
49 <li class="nav-header">API/SDK</li>
50 <li><a href="api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
51 <li><a href="csv.html" title="CSV Output"><span class="none"></span>CSV Output</a></li>
52 <li class="nav-header">Advanced Features</li>
53 <li><a href="aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
54 <li class="active"><a href="#"><span class="none"></span>Data Ingestion with Feeds</a></li>
55 <li><a href="udf.html" title="User Defined Functions"><span class="none"></span>User Defined Functions</a></li>
56 <li><a href="sqlpp/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
57 <li><a href="sqlpp/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
58 <li><a href="sqlpp/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
59 <li class="nav-header">Deprecated</li>
60 <li><a href="aql/primer.html" title="AsterixDB Primer: Using AQL"><span class="none"></span>AsterixDB Primer: Using AQL</a></li>
61 <li><a href="aql/manual.html" title="Queries: The Asterix Query Language (AQL)"><span class="none"></span>Queries: The Asterix Query Language (AQL)</a></li>
62 <li><a href="aql/builtins.html" title="Queries: Builtin Functions (AQL)"><span class="none"></span>Queries: Builtin Functions (AQL)</a></li>
63</ul>
64 <hr />
65 <div id="poweredBy">
66 <div class="clear"></div>
67 <div class="clear"></div>
68 <div class="clear"></div>
69 <div class="clear"></div>
70<a href="./" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" /></a>
71 </div>
72 </div>
73 </div>
74 <div id="bodyColumn" class="span10" >
75<!--
76 ! Licensed to the Apache Software Foundation (ASF) under one
77 ! or more contributor license agreements. See the NOTICE file
78 ! distributed with this work for additional information
79 ! regarding copyright ownership. The ASF licenses this file
80 ! to you under the Apache License, Version 2.0 (the
81 ! "License"); you may not use this file except in compliance
82 ! with the License. You may obtain a copy of the License at
83 !
84 ! http://www.apache.org/licenses/LICENSE-2.0
85 !
86 ! Unless required by applicable law or agreed to in writing,
87 ! software distributed under the License is distributed on an
88 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
89 ! KIND, either express or implied. See the License for the
90 ! specific language governing permissions and limitations
91 ! under the License.
92 !-->
93<h1>Data Ingestion with Feeds</h1>
94<div class="section">
95<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
96<ul>
97
98<li><a href="#Introduction">Introduction</a></li>
99<li><a href="#FeedAdapters">Feed Adapters</a></li>
100<li><a href="#FeedPolicies">Feed Policies</a><!--
101! Licensed to the Apache Software Foundation (ASF) under one
102! or more contributor license agreements. See the NOTICE file
103! distributed with this work for additional information
104! regarding copyright ownership. The ASF licenses this file
105! to you under the Apache License, Version 2.0 (the
106! "License"); you may not use this file except in compliance
107! with the License. You may obtain a copy of the License at
108!
109! http://www.apache.org/licenses/LICENSE-2.0
110!
111! Unless required by applicable law or agreed to in writing,
112! software distributed under the License is distributed on an
113! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
114! KIND, either express or implied. See the License for the
115! specific language governing permissions and limitations
116! under the License.
117!--></li>
118</ul></div>
119<div class="section">
120<h2><a name="Introduction">Introduction</a></h2>
121<p>In this document, we describe the support for data ingestion in AsterixDB. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users&#x2019; lives and system performance.</p></div>
122<div class="section">
123<h2><a name="Feed_Adapters"></a><a name="FeedAdapters">Feed Adapters</a></h2>
124<p>The functionality of establishing a connection with a data source and receiving, parsing and translating its data into ADM objects (for storage inside AsterixDB) is contained in a feed adapter. A feed adapter is an implementation of an interface and its details are specific to a given data source. An adapter may optionally be given parameters to configure its runtime behavior. Depending upon the data transfer protocol/APIs offered by the data source, a feed adapter may operate in a push or a pull mode. Push mode involves just one initial request by the adapter to the data source for setting up the connection. Once a connection is authorized, the data source &#x201c;pushes&#x201d; data to the adapter without any subsequent requests by the adapter. In contrast, when operating in a pull mode, the adapter makes a separate request each time to receive data. AsterixDB currently provides built-in adapters for several popular data sources such as Twitter and RSS feeds. AsterixDB additionally provides a generic socket-based adapter that can be used to ingest data that is directed at a prescribed socket.</p>
125<p>In this tutorial, we shall describe building two example data ingestion pipelines that cover the popular scenarios of ingesting data from (a) Twitter (b) RSS (c) Socket Feed source.</p>
126<div class="section">
127<div class="section">
128<h4><a name="Ingesting_Twitter_Stream"></a>Ingesting Twitter Stream</h4>
129<p>We shall use the built-in push-based Twitter adapter. As a pre-requisite, we must define a Tweet using the AsterixDB Data Model (ADM) and the query language SQL++. Given below are the type definitions in SQL++ that create a Tweet datatype which is representative of a real tweet as obtained from Twitter.</p>
130
131<div>
132<div>
133<pre class="source"> drop dataverse feeds if exists;
134
135 create dataverse feeds;
136 use feeds;
137
138 create type TwitterUser as closed {
139 screen_name: string,
140 lang: string,
141 friends_count: int32,
142 statuses_count: int32
143 };
144
145 create type Tweet as open {
146 id: int64,
147 user: TwitterUser
148 };
149
150 create dataset Tweets (Tweet) primary key id;
151</pre></div></div>
152
153<p>We also create a dataset that we shall use to persist the tweets in AsterixDB. Next we make use of the <tt>create feed</tt> SQL++ statement to define our example data feed.</p>
154<div class="section">
155<h5><a name="Using_the_.E2.80.9Cpush_twitter.E2.80.9D_feed_adapter"></a>Using the &#x201c;push_twitter&#x201d; feed adapter</h5>
156<p>The &#x201c;push_twitter&#x201d; adapter requires setting up an application account with Twitter. To retrieve tweets, Twitter requires registering an application. Registration involves providing a name and a brief description for the application. Each application has associated OAuth authentication credentials that include OAuth keys and tokens. Accessing the Twitter API requires providing the following.</p>
157<ol style="list-style-type: decimal">
158
159<li>Consumer Key (API Key)</li>
160<li>Consumer Secret (API Secret)</li>
161<li>Access Token</li>
162<li>Access Token Secret</li>
163</ol>
164<p>The &#x201c;push_twitter&#x201d; adapter takes as configuration the above mentioned parameters. End users are required to obtain the above authentication credentials prior to using the &#x201c;push_twitter&#x201d; adapter. For further information on obtaining OAuth keys and tokens and registering an application with Twitter, please visit <a class="externalLink" href="http://apps.twitter.com">http://apps.twitter.com</a>.</p>
165<p>Note that AsterixDB uses the Twitter4J API for getting data from Twitter. Due to a license conflict, Apache AsterixDB cannot ship the Twitter4J library. To use the Twitter adapter in AsterixDB, please download the necessary dependencies (<tt>twitter4j-core-4.0.x.jar</tt> and <tt>twitter4j-stream-4.0.x.jar</tt>) and drop them into the <tt>repo/</tt> directory before AsterixDB starts.</p>
166<p>Given below is an example SQL++ statement that creates a feed called &#x201c;TwitterFeed&#x201d; by using the &#x201c;push_twitter&#x201d; adapter.</p>
167
168<div>
169<div>
170<pre class="source"> use feeds;
171
172 create feed TwitterFeed with {
173 &quot;adapter-name&quot;: &quot;push_twitter&quot;,
174 &quot;type-name&quot;: &quot;Tweet&quot;,
175 &quot;format&quot;: &quot;twitter-status&quot;,
176 &quot;consumer.key&quot;: &quot;************&quot;,
177 &quot;consumer.secret&quot;: &quot;************&quot;,
178 &quot;access.token&quot;: &quot;**********&quot;,
179 &quot;access.token.secret&quot;: &quot;*************&quot;
180 };
181</pre></div></div>
182
183<p>It is required that the above authentication parameters are provided valid. Note that the <tt>create feed</tt> statement does not initiate the flow of data from Twitter into the AsterixDB instance. Instead, the <tt>create feed</tt> statement only results in registering the feed with the instance. The flow of data along a feed is initiated when it is connected to a target dataset using the connect feed statement and activated using the start feed statement.</p>
184<p>The Twitter adapter also supports several Twitter streaming APIs as follow:</p>
185<ol style="list-style-type: decimal">
186
187<li>Track filter <tt>&quot;keywords&quot;: &quot;AsterixDB, Apache&quot;</tt></li>
188<li>Locations filter <tt>&quot;locations&quot;: &quot;-29.7, 79.2, 36.7, 72.0; -124.848974,-66.885444, 24.396308, 49.384358&quot;</tt></li>
189<li>Language filter <tt>&quot;language&quot;: &quot;en&quot;</tt></li>
190<li>Filter level <tt>&quot;filter-level&quot;: &quot;low&quot;</tt></li>
191</ol>
192<p>An example of Twitter adapter tracking tweets with keyword &#x201c;news&#x201d; can be described using following ddl:</p>
193
194<div>
195<div>
196<pre class="source"> use feeds;
197
198 create feed TwitterFeed with {
199 &quot;adapter-name&quot;: &quot;push_twitter&quot;,
200 &quot;type-name&quot;: &quot;Tweet&quot;,
201 &quot;format&quot;: &quot;twitter-status&quot;,
202 &quot;consumer.key&quot;: &quot;************&quot;,
203 &quot;consumer.secret&quot;: &quot;************&quot;,
204 &quot;access.token&quot;: &quot;**********&quot;,
205 &quot;access.token.secret&quot;: &quot;*************&quot;,
206 &quot;keywords&quot;: &quot;news&quot;
207 };
208</pre></div></div>
209
210<p>For more details about these APIs, please visit <a class="externalLink" href="https://dev.twitter.com/streaming/overview/request-parameters">https://dev.twitter.com/streaming/overview/request-parameters</a></p></div></div>
211<div class="section">
212<h4><a name="Lifecycle_of_a_Feed"></a>Lifecycle of a Feed</h4>
213<p>A feed is a logical artifact that is brought to life (i.e., its data flow is initiated) only when it is activated using the <tt>start feed</tt> statement. Before we active a feed, we need to designate the dataset where the data to be persisted using <tt>connect feed</tt> statement. Subsequent to a <tt>connect feed</tt> statement, the feed is said to be in the connected state. After that, <tt>start feed</tt> statement will activate the feed, and start the dataflow from feed to its connected dataset. Multiple feeds can simultaneously be connected to a dataset such that the contents of the dataset represent the union of the connected feeds. Also one feed can be simultaneously connected to multiple target datasets.</p>
214
215<div>
216<div>
217<pre class="source"> use feeds;
218
219 connect feed TwitterFeed to dataset Tweets;
220
221 start feed TwitterFeed;
222</pre></div></div>
223
224<p>The <tt>connect feed</tt> statement above directs AsterixDB to persist the data from <tt>TwitterFeed</tt> feed into the <tt>Tweets</tt> dataset. The <tt>start feed</tt> statement will activate the feed and start the dataflow. If it is required (by the high-level application) to also retain the raw tweets obtained from Twitter, the end user may additionally choose to connect TwitterFeed to a different dataset.</p>
225<p>Let the feed run for a minute, then run the following query to see the latest tweets that are stored into the data set.</p>
226
227<div>
228<div>
229<pre class="source"> use feeds;
230
231 select * from Tweets limit 10;
232</pre></div></div>
233
234<p>The dataflow of data from a feed can be terminated explicitly by <tt>stop feed</tt> statement.</p>
235
236<div>
237<div>
238<pre class="source"> use feeds;
239
240 stop feed TwitterFeed;
241</pre></div></div>
242
243<p>The <tt>disconnnect statement</tt> can be used to disconnect the feed from certain dataset.</p>
244
245<div>
246<div>
247<pre class="source"> use feeds;
248
249 disconnect feed TwitterFeed from dataset Tweets;
250</pre></div></div>
251</div></div>
252<div class="section">
253<h3><a name="Ingesting_with_Other_Adapters"></a>Ingesting with Other Adapters</h3>
254<p>AsterixDB has several builtin feed adapters for data ingestion. User can also implement their own adapters and plug them into AsterixDB. Here we introduce <tt>socket_adapter</tt> and <tt>localfs</tt> feed adapter that cover most of the common application scenarios.</p>
255<div class="section">
256<div class="section">
257<h5><a name="Using_the_.E2.80.9Csocket_adapter.E2.80.9D_feed_adapter"></a>Using the &#x201c;socket_adapter&#x201d; feed adapter</h5>
258<p><tt>socket_adapter</tt> feed opens a web socket on the given node which allows user to push data into AsterixDB directly. Here is an example:</p>
259
260<div>
261<div>
262<pre class="source"> drop dataverse feeds if exists;
263 create dataverse feeds;
264 use feeds;
265
266 create type TestDataType as open {
267 screenName: string
268 };
269
270 create dataset TestDataset(TestDataType) primary key screenName;
271
272 create feed TestSocketFeed with {
273 &quot;adapter-name&quot;: &quot;socket_adapter&quot;,
274 &quot;sockets&quot;: &quot;127.0.0.1:10001&quot;,
275 &quot;address-type&quot;: &quot;IP&quot;,
276 &quot;type-name&quot;: &quot;TestDataType&quot;,
277 &quot;format&quot;: &quot;adm&quot;
278 };
279
280 connect feed TestSocketFeed to dataset TestDataset;
281
282 use feeds;
283 start feed TestSocketFeed;
284</pre></div></div>
285
286<p>The above statements create a socket feed which is listening to &#x201c;10001&#x201d; port of the host machine. This feed accepts data records in &#x201c;adm&#x201d; format. As an example, you can download the sample dataset <a href="../data/chu.adm">Chirp Users</a> and push them line by line into the socket feed using any socket client you like. Following is a socket client example in Python:</p>
287
288<div>
289<div>
290<pre class="source"> from socket import socket
291
292 ip = '127.0.0.1'
293 port1 = 10001
294 filePath = 'chu.adm'
295
296 sock1 = socket()
297 sock1.connect((ip, port1))
298
299 with open(filePath) as inputData:
300 for line in inputData:
301 sock1.sendall(line)
302 sock1.close()
303</pre></div></div>
304</div></div>
305<div class="section">
306<h4><a name="Using_the_.E2.80.9Clocalfs.E2.80.9D_feed_adapter"></a>Using the &#x201c;localfs&#x201d; feed adapter</h4>
307<p><tt>localfs</tt> adapter enables data ingestion from local file system. It allows user to feed data records on local disk into a dataset. A DDL example for creating a <tt>localfs</tt> feed is given as follow:</p>
308
309<div>
310<div>
311<pre class="source"> use feeds;
312
313 create type TestDataType as open {
314 screenName: string
315 };
316
317 create dataset TestDataset(TestDataType) primary key screenName;
318
319 create feed TestFileFeed with {
320 &quot;adapter-name&quot;: &quot;localfs&quot;,
321 &quot;type-name&quot;: &quot;TestDataType&quot;,
322 &quot;path&quot;: &quot;HOSTNAME://LOCAL_FILE_PATH&quot;,
323 &quot;format&quot;: &quot;adm&quot;
324 };
325
326 connect feed TestFileFeed to dataset TestDataset;
327
328 start feed TestFileFeed;
329</pre></div></div>
330
331<p>Similar to previous examples, we need to define the datatype and dataset this feed uses. The &#x201c;path&#x201d; parameter refers to the local data file that we want to ingest data from. <tt>HOSTNAME</tt> can either be the IP address or node name of the machine which holds the file. <tt>LOCAL_FILE_PATH</tt> indicates the absolute path to the file on that machine. Similarly to <tt>socket_adapter</tt>, this feed takes <tt>adm</tt> formatted data records.</p></div></div>
332<div class="section">
333<h3><a name="Datatype_for_feed_and_target_dataset"></a>Datatype for feed and target dataset</h3>
334<p>The &#x201c;type-name&#x201d; parameter in create feed statement defines the <tt>datatype</tt> of the datasource. In most use cases, feed will have the same <tt>datatype</tt> as the target dataset. However, if we want to perform certain preprocess before the data records gets into the target dataset (append autogenerated key, apply user defined functions, etc.), we will need to define the datatypes for feed and dataset separately.</p>
335<div class="section">
336<h4><a name="Ingestion_with_autogenerated_key"></a>Ingestion with autogenerated key</h4>
337<p>AsterixDB supports using autogenerated uuid as the primary key for dataset. When we use this feature, we will need to define a datatype with the primary key field, and specify that field to be autogenerated when creating the dataset. Use that same datatype in feed definition will cause a type discrepancy since there is no such field in the datasource. Thus, we will need to define two separate datatypes for feed and dataset:</p>
338
339<div>
340<div>
341<pre class="source"> use feeds;
342
343 create type DBLPFeedType as closed {
344 dblpid: string,
345 title: string,
346 authors: string,
347 misc: string
348 }
349
350 create type DBLPDataSetType as open {
351 id: uuid,
352 dblpid: string,
353 title: string,
354 authors: string,
355 misc: string
356 }
357 create dataset DBLPDataset(DBLPDataSetType) primary key id autogenerated;
358
359 create feed DBLPFeed with {
360 &quot;adapter-name&quot;: &quot;socket_adapter&quot;,
361 &quot;sockets&quot;: &quot;127.0.0.1:10001&quot;,
362 &quot;address-type&quot;: &quot;IP&quot;,
363 &quot;type-name&quot;: &quot;DBLPFeedType&quot;,
364 &quot;format&quot;: &quot;adm&quot;
365 };
366
367 connect feed DBLPFeed to dataset DBLPDataset;
368
369 start feed DBLPFeed;
370</pre></div></div>
371</div></div></div>
372<div class="section">
373<h2><a name="Policies_for_Feed_Ingestion"></a><a name="FeedPolicies">Policies for Feed Ingestion</a></h2>
374<p>Multiple feeds may be concurrently operational on an AsterixDB cluster, each competing for resources (CPU cycles, network bandwidth, disk IO) to maintain pace with their respective data sources. As a data management system, AsterixDB is able to manage a set of concurrent feeds and make dynamic decisions related to the allocation of resources, resolving resource bottlenecks and the handling of failures. Each feed has its own set of constraints, influenced largely by the nature of its data source and the applications that intend to consume and process the ingested data. Consider an application that intends to discover the trending topics on Twitter by analyzing tweets that are being processed. Losing a few tweets may be acceptable. In contrast, when ingesting from a data source that provides a click-stream of ad clicks, losing data would translate to a loss of revenue for an application that tracks revenue by charging advertisers per click.</p>
375<p>AsterixDB allows a data feed to have an associated ingestion policy that is expressed as a collection of parameters and associated values. An ingestion policy dictates the runtime behavior of the feed in response to resource bottlenecks and failures. AsterixDB provides a set of policies that help customize the system&#x2019;s runtime behavior when handling excess objects.</p>
376<div class="section">
377<div class="section">
378<h4><a name="Policies"></a>Policies</h4>
379<ul>
380
381<li>
382
383<p><i>Spill</i>: Objects that cannot be processed by an operator for lack of resources (referred to as excess objects hereafter) should be persisted to the local disk for deferred processing.</p>
384</li>
385<li>
386
387<p><i>Discard</i>: Excess objects should be discarded.</p>
388</li>
389</ul>
390<p>Note that the end user may choose to form a custom policy. For example, it is possible in AsterixDB to create a custom policy that spills excess objects to disk and subsequently resorts to throttling if the spillage crosses a configured threshold. In all cases, the desired ingestion policy is specified as part of the <tt>connect feed</tt> statement or else the &#x201c;Basic&#x201d; policy will be chosen as the default.</p>
391
392<div>
393<div>
394<pre class="source"> use feeds;
395
396 connect feed TwitterFeed to dataset Tweets using policy Basic;
397</pre></div></div></div></div></div>
398 </div>
399 </div>
400 </div>
401 <hr/>
402 <footer>
403 <div class="container-fluid">
404 <div class="row-fluid">
405<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
406 feather logo, and the Apache AsterixDB project logo are either
407 registered trademarks or trademarks of The Apache Software
408 Foundation in the United States and other countries.
409 All other marks mentioned may be trademarks or registered
410 trademarks of their respective owners.
411 </div>
412 </div>
413 </div>
414 </footer>
415 </body>
416</html>