跳到主要内容

Trevas - VTL 2.1

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas 1.7.0 upgrade to version 2.1 of VTL.

This version introduces two new operators:

  • random
  • case

random produces a decimal number between 0 and 1.

case allows for clearer multi conditional branching, for example:

ds2 := ds1[ calc c := case when r < 0.2 then "Low" when r > 0.8 then "High" else "Medium" ]

Both operators are already available in Trevas!

The new grammar also provides time operators and includes corrections, without any breaking changes compared to the 2.0 version.

See the coverage section for more details.

Trevas - Provenance

· 阅读需 4 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.6.0 introduces the VTL Prov module.

This module enables to produce lineage metadata from Trevas, based on RDF ontologies: PROV-O and SDTH.

SDTH model overview

Adopted model

The vtl-prov module, version 1.6.0, uses the following partial model:

Improvements will come in next weeks.

Tools available

Provenance Trevas tools are documented here.

Example

Business use case

Two sources datasets are transformed to produce transient datasets and a final permanent one.

Inputs

ds1 & ds2 metadata:

idvar1var2
STRINGINTEGERNUMBER
IDENTIFIERMEASUREMEASURE

VTL script

ds_sum := ds1 + ds2;
ds_mul := ds_sum * 3;
ds_res <- ds_mul[filter mod(var1, 2) = 0][calc var_sum := var1 + var2];

RDF model target

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX sdth: <http://rdf-vocabulary.ddialliance.org/sdth#>

# --- Program and steps
<http://example.com/program1> a sdth:Program ;
a prov:Agent ; # Agent? Or an activity
rdfs:label "My program 1"@en, "Mon programme 1"@fr ;
sdth:hasProgramStep <http://example.com/program1/program-step1>,
<http://example.com/program1/program-step2>,
<http://example.com/program1/program-step3> .

<http://example.com/program1/program-step1> a sdth:ProgramStep ;
rdfs:label "Program step 1"@en, "Étape 1"@fr ;
sdth:hasSourceCode "ds_sum := ds1 + ds2;" ;
sdth:consumesDataframe <http://example.com/dataset/ds1>,
<http://example.com/dataset/ds2> ;
sdth:producesDataframe <http://example.com/dataset/ds_sum> .

<http://example.com/program1/program-step2> a sdth:ProgramStep ;
rdfs:label "Program step 2"@en, "Étape 2"@fr ;
sdth:hasSourceCode "ds_mul := ds_sum * 3;" ;
sdth:consumesDataframe <http://example.com/dataset/ds_sum> ;
sdth:producesDataframe <http://example.com/dataset/ds_mul> .

<http://example.com/program1/program-step3> a sdth:ProgramStep ;
rdfs:label "Program step 3"@en, "Étape 3"@fr ;
sdth:hasSourceCode "ds_res <- ds_mul[filter mod(var1, 2) = 0][calc var_sum := var1 + var2];" ;
sdth:consumesDataframe <http://example.com/dataset/ds_mul> ;
sdth:producesDataframe <http://example.com/dataset/ds_res> ;
sdth:usesVariable <http://example.com/variable/var1>,
<http://example.com/variable/var2> ;
sdth:assignsVariable <http://example.com/variable/var_sum> .

# --- Variables
# i think here it's not instances but names we refer to...
<http://example.com/variable/id1> a sdth:VariableInstance ;
rdfs:label "id1" .
<http://example.com/variable/var1> a sdth:VariableInstance ;
rdfs:label "var1" .
<http://example.com/variable/var2> a sdth:VariableInstance ;
rdfs:label "var2" .
<http://example.com/variable/var_sum> a sdth:VariableInstance ;
rdfs:label "var_sum" .

# --- Data frames
<http://example.com/dataset/ds1> a sdth:DataframeInstance ;
rdfs:label "ds1" ;
sdth:hasName "ds1" ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds2> a sdth:DataframeInstance ;
rdfs:label "ds2" ;
sdth:hasName "ds2" ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_sum> a sdth:DataframeInstance ;
rdfs:label "ds_sum" ;
sdth:hasName "ds_sum" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds1>,
<http://example.com/dataset/ds2> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_mul> a sdth:DataframeInstance ;
rdfs:label "ds_mul" ;
sdth:hasName "ds_mul" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds_sum> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_res> a sdth:DataframeInstance ;
rdfs:label "ds_res" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds_mul> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2>,
<http://example.com/variable/var_sum> .

Trevas - SDMX

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.4.1 introduces the VTL SDMX module.

This module enables to consume SDMX metadata sources to instantiate Trevas DataStructures and Datasets.

It also allows to execute the VTL TransformationSchemes to obtain the resulting persistent datasets.

Overview

VTL SDMX DiagramVTL SDMX Diagram

Trevas supports the above SDMX message elements. Only the VtlMappingSchemes element is optional.

The elements in box 1 are used to produce Trevas DataStructures, filling VTL components attributes name, role, type, nullable and valuedomain.

The elements in box 2 are used to generate the VTL code (rulesets & transformations).

Tools available

SDMX Trevas tools are documented here.

Troubleshooting

Have a look to this section.

Trevas - Temporal operators

· 阅读需 3 分钟
Hadrien Kohl
Hadrien Kohl Consulting - Developer

Temporal operators in Trevas

The version 1.4.1 of Trevas introduces preliminary support for date and time types and operators.

The specification describes temporal types such as date, time_period, time, and duration. However, Trevas authors find these descriptions unsatisfactory. This blog post outlines our implementation choices and how they differ from the spec.

In the specification, time_period (and the types date) is described as a compound type with a start and end (or a start and a duration). This complicates the implementation and brings little value to the language as one can simply operate on a combination of dates or date and duration directly. For this reason, we defined an algebra between the temporal types and did not yet implement the time_period.

result (operators)datedurationnumber
daten/adate (+, -)n/a
durationdate (+, -)duration (+, -)duration (*)
numbern/aduration (*)n/a

The period_indicator function relies on period-awareness for types that are not defined enough at the moment to be implemented.

Java mapping

The VTL type date is represented internally as the types java.time.Instant, java.time.ZonedDateTime and java.time.OffsetDateTime

Instant represent a specific moment in time. Note that this type does not include timezone information and is therefore not usable with all the operators. One can use the types ZonedDateTime and OffsetDateTime when timezone or time saving is required.

The VTL type duration is represented internally as the type org.threeten.extra.PeriodDuration from the threeten extra package. It represents a duration using both calendar units (years, months, days) and a temporal amount (hours, minutes, seconds and nanoseconds).

Function flow_to_stock

The flow_to_stock function converts a data set with flow interpretation into a stock interpretation. This transformation is useful when you want to aggregate flow data (e.g., sales or production rates) into cumulative stock data (e.g., total inventory).

Syntax:

result := flow_to_stock(op)

Parameters:

  • op - The input data set with flow interpretation. The data set must have an identifier of type time, additional identifiers, and at least one measure of type number.

Result:

The function returns a data set with the same structure as the input, but with the values converted to stock interpretation.

Function stock_to_flow

The stock_to_flow function converts a data set with stock interpretation into a flow interpretation. This transformation is useful when you want to derive flow data from cumulative stock data.

Syntax:

result := stock_to_flow(op)

Parameters:

  • op - The input data set with stock interpretation. The data set must have an identifier of type time, additional identifiers, and at least one measure of type number.

Result:

The function returns a data set with the same structure as the input, but with the values converted to flow interpretation.

Function timeshift

The timeshift function shifts the time component of a specified range of time in the data set. This is useful for analyzing data at different time offsets, such as comparing current values to past values.

Syntax:

result := timeshift(op, shiftNumber)

Parameters:

  • op - The operand data set containing time series.
  • shiftNumber - An integer representing the number of periods to shift. Positive values shift forward in time, while negative values shift backward.

Result:

The function returns a data set with the time identifiers shifted by the specified number of periods.

Trevas - Java 17

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.2.0 enables Java 17 support.

Java modules handling

Spark does not support Java modules.

Java 17 client apps, embedding Trevas in Spark mode have to configure UNNAMED modules for Spark.

Maven

Add to your pom.xml file, in the build > plugins section:

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<compilerArgs>
<arg>--add-exports</arg>
<arg>java.base/sun.nio.ch=ALL-UNNAMED</arg>
</compilerArgs>
</configuration>
</plugin>

Docker

ENTRYPOINT ["java", "--add-exports", "java.base/sun.nio.ch=ALL-UNNAMED", "mainClass"]

Trevas - Persistent assignments

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.2.0 includes the persistent assignment support: ds1 <- ds;.

In Trevas, persistent datasets are represented as PersistentDataset.

Handle PersistentDataset

Trevas datasets are represented as Dataset.

After running the Trevas engine, you can use persistent datasets with something like:

Bindings engineBindings = engine.getContext().getBindings(ScriptContext.ENGINE_SCOPE);
engineBindings.forEach((k, v) -> {
if (v instanceof PersistentDataset) {
fr.insee.vtl.model.Dataset ds = ((PersistentDataset) v).getDelegate();
if (ds instanceof SparkDataset) {
Dataset<Row> sparkDs = ((SparkDataset) ds).getSparkDataset();
// Do what you want with sparkDs
}
}
});

Trevas - check_hierarchy

· 阅读需 2 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.1.0 includes hierarchical validation via operators define hierarchical ruleset and check_hierarchy.

Example

Input

ds1:

idMe
ABC12
A1
B10
C1
DEF100
E99
F1
HIJ100
H99
I0

VTL script

// Ensure ds1 metadata definition is good
ds1 := ds1[calc identifier id := id, Me := cast(Me, integer)];

// Define hierarchical ruleset
define hierarchical ruleset hr (variable rule Me) is
My_Rule : ABC = A + B + C errorcode "ABC is not sum of A,B,C" errorlevel 1;
DEF = D + E + F errorcode "DEF is not sum of D,E,F";
HIJ : HIJ = H + I - J errorcode "HIJ is not H + I - J" errorlevel 10
end hierarchical ruleset;

// Check hierarchy
ds_all := check_hierarchy(ds1, hr rule id all);
ds_all_measures := check_hierarchy(ds1, hr rule id always_null all_measures);
ds_invalid := check_hierarchy(ds1, hr rule id always_zero invalid);

Outputs

  • ds_all
idruleidbool_varerrorcodeerrorlevelimbalance
ABCMy_Ruletruenullnull0
  • ds_always_null_all_measures
idMeruleidbool_varerrorcodeerrorlevelimbalance
ABC12My_Ruletruenullnull0
DEF100hr_2nullnullnullnull
HIJ100HIJnullnullnullnull
  • ds_invalid
idMeruleiderrorcodeerrorlevelimbalance
HIJ100HIJHIJ is not H + I - J101

Trevas Batch 0.1.1

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas Batch 0.1.1 uses version 1.0.2 of Trevas.

This Java batch provides Trevas execution metrics in Spark mode.

The configuration file to fill in is described in the README of the project. Launching the batch will produce a Markdown file as output.

Launch

Local

java -jar trevas-batch-0.1.1.jar -Dconfig.path="..." -Dreport.path="..."

The java execution will be done in local Spark.

Kubernetes

Default Kubernetes objects are defined in the .kubernetes folder.

Feed the config-map.yml file then launch the job in your cluster.

Trevas Jupyter 0.3.2

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas Jupyter 0.3.2 uses version 1.0.2 of Trevas.

News

In addition to the VTL coverage greatly increased since the publication of Trevas 1.x.x, Trevas Jupyter offers 1 new connector:

  • SAS files (via the loadSas method)

Launch

Manually adding the Trevas Kernel to an existing Jupyter instance

  • Trevas Jupyter compiler
  • Copy the kernel.json file and the bin and repo folders to a new kernel folder.
  • Edit the kernel.json file
  • Launch Jupyter

Docker

docker pull inseefrlab/trevas-jupyter:0.3.2
docker run -p 8888:8888 inseefrlab/trevas-jupyter:0.3.2

Helm

The Trevas Jupyter docker image can be instantiated via the jupyter-pyspark Helm contract from InseeFrLab.

Trevas Lab 0.3.3

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas Lab 0.3.3 uses version 1.0.2 of Trevas.

News

In addition to the VTL coverage greatly increased since the publication of Trevas 1.x.x, Trevas Lab offers 2 new connectors:

  • SAS files
  • JDBC MariaDB

Launch

Kubernetes

Sample Kubernetes objects are available in the .kubernetes folders of Trevas Lab and Trevas Lab UI.