The Metamarkets Realtime Data Ingestion (RDI) platform receives and processes data in real time. In turn, the data you upload can be viewed in the Metamarkets dashboard in seconds. This document explains real-time ingestion in more detail, and shows you how to upload your data to RDI.
In the context of RDI, ingestion refers to the intake of data for the purpose of processing that data and surfacing it as information in the Metamarkets dashboard. Unlike systems limited to batch processing only, which ingest one portion of the data at a time, real-time ingestion can accept a stream of data. The stream is initiated and executed programmatically, using an HTTPS connection pool. This type of connection allows for concurrent and continuous uploads of data files while minimizing latency.
Before uploading data to RDI, be sure to understand what formats and data structures are supported. Your data should follow the standards and schema prescribed in the appropriate Metamarkets data formatting guide.
With RDI, you continually deliver data to an HTTPS endpoint (using POST operations), where it is immediately prepared for display and querying on your Metamarkets dashboard. RDI accepts data on the basis of event timestamp; properly formatted data that does not meet on-time delivery requirements is processed at a later time (see this section for more information).
Use the following as a checklist before getting started with uploading data:
Data files should be plain-text JSON format with one record per line, with each record separated from the next by a newline character. Records should follow the schema prescribed in the appropriate Metamarkets data formatting guide. Although you can batch records into files of up to a megabyte in size (compressed), batch sizes of approximately 0.5 MB pre-compressed are recommended for best performance.
Be sure to use UTF-8 encoding and use Content-type: application/json
in the HTTP header.
Files should be compressed to maximize throughput. Compressed formats allow for faster ingestion of your log files by RDI, in addition to reducing your storage and delivery costs. RDI accepts files compressed with the gzip format.
If using compression (recommended), add an extension to the file name indicating the type of compression (e.g., <name>.gz).
To upload your data, simply initiate the connection to the RDI endpoints using your chosen HTTP-client library. RDI will respond with an HTTP 2xx code within 50–100ms, depending on the volume of data being uploaded.
Be sure to deliver data in a smooth and continuous pattern. Attempting to deliver data in large batches, for example once per hour, may cause the volume rate to exceed your quota and result in a failed upload (HTTP 420 error code).
If you experience an issue with the upload, see the advice in the Best Practices and the FAQ.
These recommendations are intended to provide you with a trouble-free process when uploading your data to RDI. Some recommendations are not applicable for low data volumes, but will make scaling volumes an easier and smoother process.
Use a connection pool in your HTTP client to allow for concurrent connections when data is posted. Set the pool to have a maximum of four connections. If throughput suffers or if you notice that all four connections are often being used simultaneously, increase this number. Set a timeout of 15 seconds; connections should never reach this timeout unless your network is down or extremely noisy.
To make efficient use of HTTPS connections, ensure that they are persistent (keep-alive is enabled). HTTP 1.1 has persistent connections by default. See the documentation for your HTTP server for configuring persistent connections.
If you wish to upload a substantial amount of "catch up" or historical data, contact your Metamarkets account manager to discuss setting up a backfill operation. A data set that covers several hours or more may be too large to upload via the method described in the troubleshooting section.
Parse the request HTTP status for a success (2xx) or failure (e.g., 4xx) code. See the troubleshooting section below on how to handle errors.
To prevent data loss, implement a procedure for handling general delivery failures, such as when your volume substantially exceeds quotas.
For retries, use an exponential back-off algorithm.
If you are uploading lookup tables to an AWS S3 bucket, use the us-east region to maximize proximity to the RDI servers.
If you experience any problems after you begin uploading data to RDI, refer to these solutions to common issues. If you still cannot resolve your issue, contact your Metamarkets representative or send an email to support@metamarkets.com.
You may notice that data that has been successfully posted to RDI has not been immediately surfaced in the dashboard. RDI is intended for displaying current data. Current data is defined as data with events that have a timestamp no more than 10 minutes behind the time the data is posted to RDI.
Data with a timestamp more than 10 minutes older than current time is saved and surfaced in your dashboard usually within 48 hours. Data older than 2 days will be stored but not automatically surfaced. In this case, contact your Metamarkets account manager to arrange for the data to be backfilled.
If your data uploads fall behind, do not attempt to backfill all of the data by uploading it at once. That will likely exceed your volume quota and result in a failed upload (HTTP 420 error code). Instead, increase the overall throughput rate at which data is posted to a level between the normal rate and the quota (but no more than twice the normal rate), until data timestamps are near current time. If an HTTP 420 error code is returned, reduce the rate.
In order to support your actual data requirements, your agreement with Metamarkets includes a quota for a maximum number of events per stated time period. If that quota is violated, an HTTP 420 code is returned, indicating that the undelivered data should be dropped or re-sent using an exponential backoff pattern. Contact your Metamarkets account manager if you need more information about your quota or require a change.
Other 4xx codes imply a problem with the received request. For example, 401 indicates an access problem related to permission to perform a POST, while a 404 indicates no such resource. For these problems, confirm that your HTTP-client code has the correct HTTPS endpoint URL and credentials to successfully connect.
If RDI servers are temporarily unavailable or overloaded, an HTTP 5xx error code is returned. Either drop the undelivered data or retry sending it using an exponential backoff pattern.