Spark mode - SDMX source

vtl-sdmx module exposes the following utilities.

`buildStructureFromSDMX3` utility

TrevasSDMXUtils.buildStructureFromSDMX3 allows to obtain a Trevas DataStructure.

Providing corresponding data, you can build a Trevas Dataset.

Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "STRUCT_ID");

SparkDataset ds = new SparkDataset(
        spark.read()
                .option("header", "true")
                .option("delimiter", ";")
                .option("quote", "\"")
                .csv("path"),
        structure
);

`SDMXVTLWorkflow` object

The SDMXVTLWorkflow constructor takes 3 arguments:

a ScriptEngine (Trevas or another)
a ReadableDataLocation to handle an SDMX message
a map of names / Datasets

SparkSession.builder()
                .appName("test")
                .master("local")
                .getOrCreate();

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());

This object then allows you to activate the following 3 functions.

SDMXVTLWorkflow `run` function - Preview mode

The run function can easily be called in a preview mode, without attached data.

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());

// instead of using TrevasSDMXUtils.buildStructureFromSDMX3 and data sources
// to build Trevas Datasets, sdmxVtlWorkflow.getEmptyDatasets()
// will handle SDMX message structures to produce Trevas Datasets
// with metadata defined in this message, and adding empty data
Map<String, Dataset> emptyDatasets = sdmxVtlWorkflow.getEmptyDatasets();
engine.getBindings(ScriptContext.ENGINE_SCOPE).putAll(emptyDatasets);

Map<String, PersistentDataset> result = sdmxVtlWorkflow.run();

The preview mode allows to check the conformity of the SDMX file and the metadata of the output datasets.

SDMXVTLWorkflow `run` function

Once an SDMXVTLWorkflow is built, it is easy to run the VTL validations and transformations defined in the SDMX file.

Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "ds1");

SparkDataset ds1 = new SparkDataset(
        spark.read()
                .option("header", "true")
                .option("delimiter", ";")
                .option("quote", "\"")
                .csv("path/data.csv"),
        structure
);

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

Map<String, Dataset> inputs = Map.of("ds1", ds1);

ReadableDataLocation rdl = new ReadableDataLocationTmp("path/sdmx_file.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, inputs);

Map<String, PersistentDataset> bindings = sdmxVtlWorkflow.run();

As a result, one will receive all the datasets defined as persistent in the TransformationSchemes definition.

SDMXVTLWorkflow `getTransformationsVTL` function

Gets the VTL code corresponding to the SDMX TransformationSchemes definition.

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String vtl = sdmxVtlWorkflow.getTransformationsVTL();

SDMXVTLWorkflow `getRulesetsVTL` function

Gets the VTL code corresponding to the SDMX TransformationSchemes definition.

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String dprs = sdmxVtlWorkflow.getRulesetsVTL();

Troubleshooting

Hadoop client

The integration of vtl-modules with hadoop-client can cause dependency issues.

It was noted that com.fasterxml.woodstox.woodstox-core is imported by hadoop-client, with an incompatible version for a vtl-sdmx sub-dependency.

A way to fix this is to exclude com.fasterxml.woodstox.woodstox-core dependency from hadoop-client and import a newest version in your pom.xml:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.3.4</version>
    <exclusions>
        <exclusion>
            <groupId>com.fasterxml.woodstox</groupId>
            <artifactId>woodstox-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.fasterxml.woodstox</groupId>
    <artifactId>woodstox-core</artifactId>
    <version>6.5.1</version>
</dependency>

buildStructureFromSDMX3 utility​

SDMXVTLWorkflow object​

SDMXVTLWorkflow run function - Preview mode​

SDMXVTLWorkflow run function​

SDMXVTLWorkflow getTransformationsVTL function​

SDMXVTLWorkflow getRulesetsVTL function​

Troubleshooting​

Hadoop client​