Technical Deep Dive¶

Intro¶

The bitdrift Capture SDK makes it easy, quick and cheap to gather insights from applications.

Easy as the SDK provides ergonomic APIs that enable developers to emit as much telemetry as they want without having to worry about a negative impact on user experience. Alongside it, the SDK's powerful and configurable out-of-the-box default telemetry simplifies SDK integration and provides detailed insights into what’s happening in applications with a few lines of code.
Quick as the SDK’s persistent connection to the bitdrift control plane and rich set of real-time configuration settings allow developers to change the behavior of connected clients without any code changes, shortening development cycles from weeks to minutes.
Cheap in terms of dollar amount spent, as the SDK's extensive, server-controlled configuration allows its users to change which emitted telemetry ends up being sent to remote services. The bitdrift control plane allows the bitdrift system to move the processing and analysis of data out of the warehouse and puts the processing smarts next to where the data is generated.

Overview¶

The SDK’s core is implemented in Rust and shared by the bitdrift family of products on both server and mobile. The common core helps to ensure consistent behavior and feature set of the SDK on each of the supported platforms. It also helps to make the Capture SDK blazingly fast and memory safe, something that the Rust language is famous for.

The Rust core provides the heart of the Capture SDK, the ring buffer. The ring buffer is the component that makes the illusion of Capture SDK telemetry being “free”¹ possible. The ring buffer utilizes multi-tiered storage; for the best possible performance, telemetry is first written to (bounded) RAM, and later delay flushed to (bounded) disk. Disk operations are further optimized via the use of memory mapped files managed by the OS. All disk operations are performed on background threads that are created and managed by the Rust core.

The Rust core is also responsible for maintaining a bidirectional stateful connection between the SDK and the bitdrift control plane. The client uses the connection to retrieve the configuration from the control plane and to upload requested telemetry. The SDK caches fetched configuration on disk and continues to operate normally even if the control plane is temporarily unavailable. The persistent connection makes the control plane aware of all connected devices and enables it to send different configurations to different SDK instances² in a performant way.

The bitdrift control plane configures fundamental aspects of how the SDK collects and sends telemetry data. It controls the number and size of buffers used by the SDK, instructs the SDK as to which telemetry should go to which of the configured buffer(s) (if any), and allows for a dynamic definition of workflows. Workflows enable definition of multi-step flows that, if recorded, trigger a particular SDK action such as a flush of collected telemetry data or an emission of new telemetry data³. Workflows are a key element of the SDK that moves the processing of the data close to where it’s emitted, enabling Capture SDK customers to do things that were previously infeasible due to high cost and resource usage.

Mobile¶

The bitdrift mobile SDK is the Rust core with thin Swift and Kotlin wrappers for iOS and Android platforms, respectively. The shared Rust core helps to ensure consistency in the SDK’s behavior on each platform.

Networking¶

While the persistent connection between the SDK and bitdrift control plane is managed by the Rust core, the platform layer networking libraries - URLSession on iOS and OkHttp on Android - are utilized for performing the actual network requests to ensure that the SDK follows networking best practices on each OS.

Resource Utilization¶

One of the main design goals for the SDK is to provide the illusion that telemetry is free by utilizing as little resources as possible. This section explains how the SDK accomplishes that goal while giving customers settings that can be used to adjust the behavior to their needs.

Memory & Disk¶

Takeaways

The SDK has configurable settings for the maximum amount of RAM and disk space it can use for various parts of the system.
Resource consumption changes can be applied in real-time to all connected clients, and can differ between clients. I.e., clients with more limited resources can be configured to consume less resources.
For best performance, the SDK writes telemetry to RAM first and later delay-flushes them to disk.
For best performance, disk operations are batched and performed with the use of low level APIs such as memory mapped files

Multiple components of the Capture SDK support configurable settings that allow SDK customers to adjust the amount of resources that the SDK uses.

In particular, the ring buffer that the SDK uses to store telemetry offers settings for controlling how much RAM and disk space the buffer uses. The SDK supports the creation of multiple ring buffers, each one of them with different limits set up.

Example portion of bitdrift control plane configuration sent to the Capture SDK:

YAML

  - name: "Verbose Buffer"
    id: "default_buffer_id"
    buffer_sizes:
      volatile_buffer_size_bytes: 2097152 # 2 MiB
      non_volatile_buffer_size_bytes: 5242880 # 5 MiB

  - name: "Continuous buffer"
    id: "continuous_buffer"
    buffer_sizes:
      volatile_buffer_size_bytes: 2097152 # 2 MiB
      non_volatile_buffer_size_bytes: 5242880 # 5 MiB

Lower limits help to ensure that the SDK footprint is smaller, but reduces the amount of telemetry that the SDK can potentially store for future uploads.

Network¶

Takeaways

The SDK does not upload any telemetry by default.
The SDK can be configured to upload any or all emitted telemetry.
Configuration changes can be applied in real-time to all connected clients.

By default, the SDK does not upload any emitted telemetry, keeping its network bandwidth usage minimal. That behavior is controlled with the use of configuration coming from the bitdrift control plane and can be updated and broadcasted to connected clients at any time. The SDK can be shipped first and the configuration updated later retroactively. For example, the part of the configuration responsible for controlling which telemetry should be uploaded versus which should not can be updated on the fly.

The live-configuration feature of the SDK is possible thanks to a bidirectional streaming connection that each Capture SDK client establishes when it’s initialized. Throughout the lifetime of the SDK, the persistent connection between client and the server is used:

By the server to learn about all connected clients
By the server to push configuration updates to connected clients
By the SDK to upload requested telemetry
By the SDK to report errors and stats

CPU¶

Takeaways

The Capture SDK can be configured and used from any thread and/or queue.
The SDK performs most of its work on its own background threads.
The SDK configuration and telemetry emissions are fast operations.

Once configured, the SDK creates background threads that it uses to perform the majority of its tasks. All heavy lifting is moved to background threads to minimize the impact the SDK has on the rest of the application.

Both the configuration and the emission of telemetry are fast operations from the perspective of their callers⁴. They do not block on any disk operations or perform any other “heavy” work.

With respect to the emission of logs in particular, developers are encouraged to emit them from any thread as the great majority of processing is moved off the caller’s thread and into aforementioned background threads created by the SDK.

Platform Specific Details¶

All platform specific operations and APIs are accessed on background threads/queues where appropriate. The exception being some of the out-of-the-box telemetry that depends on mobile platform APIs that require the use of the main thread⁵. The configuration of out-of-the-box telemetry provides multiple fine grained settings that can be used to disable/enable features as desired.

On iOS, APIs accepting queues such as notification center subscriptions use serial queues targeting one of the global GCD (Grand Central Dispatch) queues. The queues hierarchy used by the SDK resembles a tree with a few root queues that are targeted by multiple queues used by various SDK subsystems.

Mobile Binary Size¶

Takeaways

As of this writing, the SDK increases app download size by:

< 1 MB on iOS (App Store)
< 1 MB on Android (Google Play Store)

The SDK is currently distributed as a static library on iOS and a dynamic library on Android.

The static approach from the iOS platform allows the compiler to perform extra optimization at linking time. Android applications do not support linking of native static libraries that the Capture SDK is an example of and for that reason the SDK is distributed as a dynamic library on Android.

App Store / Google Play Store Submissions¶

Takeaways

The SDK does not request permissions to any resources that require showing a system dialog i.e., user location, microphone or camera.
The SDK does not show any kind of UI.

SDK Flows¶

Initialization¶

Takeaways

The SDK does not perform any work until it is configured.
Configuration of the SDK does not perform any heavy blocking work.

The configuration of the SDK is performed with a start(...) method call. The SDK doesn’t do any work until it’s configured, allowing developers to hide the usage of the SDK behind a feature flag or a runtime variable.

Upon configuration of the SDK, a bidirectional stream is established with the bitdrift control plane to inform the SaaS about its existence and retrieve the most up-to-date configuration for itself⁶. Once retrieved from the server the configuration is cached on a disk and loaded by the SDK on subsequent start ups to help with cases of poor network connectivity. Every time a client receives an updated configuration from the server it overrides the existing cached configuration on disk (if any).

Telemetry Processing Pipeline¶

The telemetry processing pipeline consists of multiple steps, each one fulfilling an important role ensuring the Capture SDK provides best-in-class performance.

All of the buffers used by the SDK implement an upper memory bound limit to enforce a limited (and configurable) use of the memory.

Async Telemetry Buffer¶

The async telemetry buffer acts as a gateway for telemetry to move from the caller thread to the Rust core background thread. Only after a given piece of data arrives on the Rust core background thread does the SDK call into registered grouping, date and field providers to retrieve extra information to attach to the outgoing telemetry.

Pre-config Telemetry Buffer¶

The pre-config telemetry buffer is used as a buffer for telemetry which is retrieved from the async telemetry buffer when there is no ring buffer that’s yet ready to accept telemetry for processing. This happens for a brief period⁷ of time after the Capture SDK is configured and the async process of creating the ring buffer is started.

Ring Buffer¶

The ring buffer is the final destination for telemetry processed within the SDK. From here, telemetry may or may not be selected for upload. The buffer stores incoming telemetry in its RAM storage and periodically flushes it to the underlying persistent disk storage.

The bitdrift control plane allows configuring multiple ring buffers for every Capture SDK instance. Each one of the configured ring buffers may have separate:

Rules for what telemetry should go into a buffer.
Rules for what telemetry should be uploaded.
Limits for the maximum amount of resources a buffer may use.

The size of the buffer - both its RAM and persistent disk portions - is configurable via the control plane. The buffer supports real-time hot swapping of its configuration which enables SDK users to update ring buffer configuration in their applications at any time. As the buffer gets full it starts replacing the oldest stored telemetry with incoming data.

Telemetry stored in the ring buffer is resilient to app crashes and unexpected terminations of the app in general. Partially written records do not corrupt the entire storage.

Workflows¶

Workflows move the processing of telemetry data from remote warehouses to where it’s emitted. As the processing of telemetry data happens on a device the data isn’t sent to bitdrift, limiting SDK’s use of resources such as network bandwidth and the dollar amount spent on storing collected telemetry data.

Controlled by bitdrift’s control plane and distributed to all connected clients, workflows tell the SDK how to process emitted telemetry. A state machine allows the SDK to track not only discretionary events such as the emission of a given piece of telemetry but whole flows where a sequence of telemetry is recorded (e.g., the application was launched and a user saw a crash after going through a user registration flow).

Workflow diagram definition example

The workflow consists of step(s) and exit node(s). Steps are defined by matching operations on emitted telemetry data, e.g., log message needs to be equal to “foo”. For each step, multiple matching rules can be combined using logical operators, e.g., a log message is equal to “foo” and the log level is equal to “error”. Exit steps define actions to perform when the workflow goes through all of the preceding steps.

The workflows implementation ensures low and bounded usage of resources:

Steps support linear time matching regex and are pre-compiled.
Workflows are finite and bounded by time. The number of iterations and instances of a workflow is bounded.
Workflows are represented by finite automata that are built by the bitdrift control plane and sent to the client ready to be run.

Production Readiness¶

The SDK is production ready. In fact, it’s currently used by millions of users on a daily basis. Among others, it’s currently active in the Lyft mobile app on both iOS and Android platforms.

Notes¶

“Free” as in “so cheap to emit device resource wise” that developers can just emit as many of them as they want to without having to worry about the resource usage. ↩
I.e., “send configuration X to iOS users running 7.15 version of the app” ↩
I.e., emission of a new stat or log. ↩
The configuration of the SDK on the Android platform performs a System.loadLibrary call that depending on the resource contention in the application may be considered to be a non-light operation. That call is performed by all Android native libraries i.e., Google Maps. ↩
An example being the “Session Replay” feature that traverses the view hierarchy on iOS and Android platforms. The SDK can be configured to have the feature enabled or disabled and the developer can control the frequency at which the Session Replay traverses the view hierarchy. ↩
The Capture SDK supports sending different configurations to different Capture SDK instances. For example, the SDK on iOS may be configured to store more telemetry than the SDK on Android. ↩
The only time when this process may take more than a fraction of a second is after the first configuration of the SDK on a given device. In that case, the SDK needs to retrieve the configuration from the server before it creates the ring buffer. For cases when the connectivity is poor, this process may take some time. Nevertheless, even without cached configuration and an ability to fetch configuration from the server the SDK continues to collect emitted telemetry using a default configuration. ↩