Spark JMS Connector¶

Spark JMS connector provides functionality for reading from and writing to JMS queues using Spark Structured Streaming. To ensure at-least-once delivery for streaming sources, the connector uses a write-ahead log to keep messages that haven’t yet been successfully written to the destination. The project includes two versions of the connector: one built on Spark DataSource V1 and another on DataSource V2, with both supporting streaming sources and sinks. The implementations support similar configuration options and features. They use a provider-agnostic approach to connect to different JMS messaging systems. Both implementations are fully functional, but they were created for educational purposes and have some limitations. The main purpose of this project is to provide a working example of how to implement a JMS connector for Apache Spark.

Please read the disclaimer below before using this connector.

Features¶

V1 and V2 APIs Support: Includes both Spark DataSource V1 and V2 implementations.
Streaming Source Support: Reads from JMS queues as Spark streaming sources.
Streaming Sink Support: Writes Spark streaming data to JMS destinations.
Configurable Message Format: Supports text (jakarta.jms.TextMessage) and binary (jakarta.jms.BytesMessage) formats.
At-Least-Once Delivery for Source: Messages are acknowledged after being successfully stored using a write-ahead log.
Provider-Agnostic: Works with any JMS-compliant messaging system (ActiveMQ, IBM MQ, etc.) but requires a corresponding implementation of ConnectionFactoryProvider.

Limitations and Considerations¶

The connector is still in development and may contain bugs and limitations. Since it's mostly a proof-of-concept, it's not recommended for production use. The limitations include the following:

JMS connections are not pooled.
No full support of the JMS 2.0 specification.
Sending messages is done in executors, so every executor task creates its own connection.
The number of connections used by the sink component depends on the number of partitions.
Connections in the sink component are created for each batch and not reused.
The connector uses a fail-fast strategy, so no proper retry logic for failed writes is implemented. The sink component relies on the Spark built-in retry mechanism.
Messages are written to the write-ahead log using Java serialization, which may be non-optimal, especially for large messages. Changing the JVM or Spark version can make the existing write-ahead log unreadable.
Receiving messages from a queue is done in a driver in a single threaded manner, so it may affect performance in distributed environments.

Requirements¶

Java 17
Scala 2.12.x / 2.13.x (Spark 3.5.x) and 2.13.x (Spark 4.0.x / 4.1.x)
Apache Spark 3.5.x / 4.0.x / 4.1.x
Jakarta JMS API 3.1.0

The code has been tested on Java 17, but it may also work with other Java versions depending on the Apache Spark version you use. For Spark 3.5.x, the library can be built against Scala 2.12 or 2.13 in contrast to versions for Spark 4.0.x and 4.1.x which can only be built against Scala 2.13.

License¶

This project is released under the MIT License, which allows you to freely use, modify, and distribute the code with minimal restrictions. By using this project, you agree to the terms outlined in the license.

Disclaimer¶

THIS SOFTWARE IS PROVIDED "AS IS" FOR EDUCATIONAL AND EXPERIMENTAL PURPOSES.

This connector should be used with caution in production environments. The author(s) and contributors make no warranties, express or implied, about the completeness, reliability, or suitability of this software for any particular purpose.

The author(s) accept NO RESPONSIBILITY for any consequences, damages, or losses arising from the use of this software, including but not limited to:

Data loss or corruption
System failures or downtime
Performance issues
Security vulnerabilities
Financial losses

Before deploying to production pipelines:

Thoroughly test it in a development/staging environment
Implement proper error handling and monitoring
Ensure you have adequate backup and recovery procedures
Review and understand the code thoroughly
Consider your specific use case and requirements

Use at your own risk.