Gå til hovedinnhold

Spark mode - SDMX source

vtl-sdmx module exposes the following utilities.

buildStructureFromSDMX3 utility

TrevasSDMXUtils.buildStructureFromSDMX3 allows to obtain a Trevas DataStructure.

Providing corresponding data, you can build a Trevas Dataset.

Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "STRUCT_ID");

SparkDataset ds = new SparkDataset(
spark.read()
.option("header", "true")
.option("delimiter", ";")
.option("quote", "\"")
.csv("path"),
structure
);

SDMXVTLWorkflow object

The SDMXVTLWorkflow constructor takes 3 arguments:

  • a ScriptEngine (Trevas or another)
  • a ReadableDataLocation to handle an SDMX message
  • a map of names / Datasets
SparkSession.builder()
.appName("test")
.master("local")
.getOrCreate();

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());

This object then allows you to activate the following 3 functions.

SDMXVTLWorkflow run function - Preview mode

The run function can easily be called in a preview mode, without attached data.

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());

// instead of using TrevasSDMXUtils.buildStructureFromSDMX3 and data sources
// to build Trevas Datasets, sdmxVtlWorkflow.getEmptyDatasets()
// will handle SDMX message structures to produce Trevas Datasets
// with metadata defined in this message, and adding empty data
Map<String, Dataset> emptyDatasets = sdmxVtlWorkflow.getEmptyDatasets();
engine.getBindings(ScriptContext.ENGINE_SCOPE).putAll(emptyDatasets);

Map<String, PersistentDataset> result = sdmxVtlWorkflow.run();

The preview mode allows to check the conformity of the SDMX file and the metadata of the output datasets.

SDMXVTLWorkflow run function

Once an SDMXVTLWorkflow is built, it is easy to run the VTL validations and transformations defined in the SDMX file.

Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "ds1");

SparkDataset ds1 = new SparkDataset(
spark.read()
.option("header", "true")
.option("delimiter", ";")
.option("quote", "\"")
.csv("path/data.csv"),
structure
);

ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");

Map<String, Dataset> inputs = Map.of("ds1", ds1);

ReadableDataLocation rdl = new ReadableDataLocationTmp("path/sdmx_file.xml");

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, inputs);

Map<String, PersistentDataset> bindings = sdmxVtlWorkflow.run();

As a result, one will receive all the dataset defined as persistent in the TransformationSchemes definition.

SDMXVTLWorkflow getTransformationsVTL function

Gets the VTL code corresponding to the SDMX TransformationSchemes definition.

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String vtl = sdmxVtlWorkflow.getTransformationsVTL();

SDMXVTLWorkflow getRulesetsVTL function

Gets the VTL code corresponding to the SDMX TransformationSchemes definition.

SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String dprs = sdmxVtlWorkflow.getRulesetsVTL();

Troubleshooting

Hadoop client

The integration of vtl-modules with hadoop-client can cause dependency issues.

It was noted that com.fasterxml.woodstox.woodstox-core is imported by hadoop-client, with an incompatible version for a vtl-sdmx sub-dependency.

A way to fix is to exclude com.fasterxml.woodstox.woodstox-core dependency from hadoop-client and import a newest version in your pom.xml:

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.woodstox</groupId>
<artifactId>woodstox-core</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.fasterxml.woodstox</groupId>
<artifactId>woodstox-core</artifactId>
<version>6.5.1</version>
</dependency>