Spark mode - SDMX source
vtl-sdmx
module exposes the following utilities.
buildStructureFromSDMX3
utility
TrevasSDMXUtils.buildStructureFromSDMX3
allows to obtain a Trevas DataStructure.
Providing corresponding data, you can build a Trevas Dataset.
Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "STRUCT_ID");
SparkDataset ds = new SparkDataset(
spark.read()
.option("header", "true")
.option("delimiter", ";")
.option("quote", "\"")
.csv("path"),
structure
);
SDMXVTLWorkflow
object
The SDMXVTLWorkflow
constructor takes 3 arguments:
- a
ScriptEngine
(Trevas or another) - a
ReadableDataLocation
to handle an SDMX message - a map of names / Datasets
SparkSession.builder()
.appName("test")
.master("local")
.getOrCreate();
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");
ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
This object then allows you to activate the following 3 functions.
SDMXVTLWorkflow run
function - Preview mode
The run
function can easily be called in a preview mode, without attached data.
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");
ReadableDataLocation rdl = new ReadableDataLocationTmp("src/test/resources/DSD_BPE_CENSUS.xml");
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
// instead of using TrevasSDMXUtils.buildStructureFromSDMX3 and data sources
// to build Trevas Datasets, sdmxVtlWorkflow.getEmptyDatasets()
// will handle SDMX message structures to produce Trevas Datasets
// with metadata defined in this message, and adding empty data
Map<String, Dataset> emptyDatasets = sdmxVtlWorkflow.getEmptyDatasets();
engine.getBindings(ScriptContext.ENGINE_SCOPE).putAll(emptyDatasets);
Map<String, PersistentDataset> result = sdmxVtlWorkflow.run();
The preview mode allows to check the conformity of the SDMX file and the metadata of the output datasets.
SDMXVTLWorkflow run
function
Once an SDMXVTLWorkflow
is built, it is easy to run the VTL validations and transformations defined in the SDMX file.
Structured.DataStructure structure = TrevasSDMXUtils.buildStructureFromSDMX3("path/sdmx_file.xml", "ds1");
SparkDataset ds1 = new SparkDataset(
spark.read()
.option("header", "true")
.option("delimiter", ";")
.option("quote", "\"")
.csv("path/data.csv"),
structure
);
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByExtension("vtl");
engine.put(VtlScriptEngine.PROCESSING_ENGINE_NAMES, "spark");
Map<String, Dataset> inputs = Map.of("ds1", ds1);
ReadableDataLocation rdl = new ReadableDataLocationTmp("path/sdmx_file.xml");
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, inputs);
Map<String, PersistentDataset> bindings = sdmxVtlWorkflow.run();
As a result, one will receive all the datasets defined as persistent in the TransformationSchemes
definition.
SDMXVTLWorkflow getTransformationsVTL
function
Gets the VTL code corresponding to the SDMX TransformationSchemes definition.
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String vtl = sdmxVtlWorkflow.getTransformationsVTL();
SDMXVTLWorkflow getRulesetsVTL
function
Gets the VTL code corresponding to the SDMX TransformationSchemes definition.
SDMXVTLWorkflow sdmxVtlWorkflow = new SDMXVTLWorkflow(engine, rdl, Map.of());
String dprs = sdmxVtlWorkflow.getRulesetsVTL();
Troubleshooting
Hadoop client
The integration of vtl-modules
with hadoop-client
can cause dependency issues.
It was noted that com.fasterxml.woodstox.woodstox-core
is imported by hadoop-client
, with an incompatible version for a vtl-sdmx
sub-dependency.
A way to fix this is to exclude com.fasterxml.woodstox.woodstox-core
dependency from hadoop-client
and import a newest version in your pom.xml
:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.woodstox</groupId>
<artifactId>woodstox-core</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.fasterxml.woodstox</groupId>
<artifactId>woodstox-core</artifactId>
<version>6.5.1</version>
</dependency>