Browse Source

Inital README.md

diff-output-compression
haavee 9 months ago
parent
commit
50783498c0
1 changed files with 440 additions and 0 deletions
  1. +440
    -0
      README.md

+ 440
- 0
README.md View File

@@ -0,0 +1,440 @@
# JIVE-toolchain-verify

At JIVE several data conversions take place. The most-used ones are from correlator output to CASA
MeasurementSet v2[^1] (`j2ms2`), and from MeasurementSet v2 to FITS-IDI (`tConvert`), the FITS Interferometry Data
Interchange, format[^2].

No tools existed yet to verify whether those data format transformation
utilities (`j2ms2`, `tConvert`) actually did their work correctly - i.e. losing, adding, or corrupting (meta-)data.

This repository contains a number of tools that can be used to compare
different data formats and test important quantities for equality to asseess
loss, addition or changes in those.

## Getting Started

After
```bash
> cd /path/to/somewhere
> git clone <this repo>
```

several Python2/3 compatible scripts become available:
```
check-multipart-fits.py
compare-ms-idi.py
compare-ms-idi-meta.py
fix-uvf.py
```

whose purpose will be explained below.

Running the scripts using a specific (local?) Python can be easily done like
so
```bash
> /path/to/python-X.Y.Z /path/to/compare-ms-idi.py [options]
```

# check-multipart-fits.py

Checks if there is data loss between the boundaries of subsequent IDI files
produced from a single MeasurementSet. Reports how much observing time (in
seconds) was lost across all boundaries.

## Usage

The script does not have a `-h`/`--help` command line option. In fact it has
no options at all.

Run as follows:
```bash
> check-multipart-fits.py '/path/to/exp/*.IDI*' '/path/to/other/*.IDI*'
```
and wait. The `'`'s are necessary in order to let the script expand the
wildcards in stead of the shell.

Note that hierarchical wildcards are supported too to dig through
directories of directories (etc) of FITS-IDI files:
```bash
> check-multipart-fits.py '/path/to/*/*/*.IDI*'
```

### Background
JIVE limits the IDI fits files to <= 2 GB per IDI file in order to not upset
32-bit software. If a MeasurementSet is larger than this, `tConvert` will
break it up into chunks of at must 2 GB / piece, naming the files

```bash
> tConvert input.ms OUTPUT.IDI
...
> ls OUTPUT*
OUTPUT.IDI1
OUTPUT.IDI2
...
OUTPUT.IDIn
```

Due to a bug/erroneous assumption made inside tConvert, an unfixed version
could trigger data loss between `OUTPUT.IDIm` and `OUTPUT.IDI(m+1)`.

This tool checks the time stamps at start of `m+1` with the last time stamp
in `m` and sums up all the differences across all `m-1` boundaries.

There is a caveat that if a scan boundary/gap would be such that _all_ data from
the last scan before the gap is in chunks `<=m` and data for the first scan
after the gap is in chunk `m+1`, that would count as a loss of `gap
duration` seconds.

However the chances of that are minimal and such occurrences have not yet been
observed checking the whole EVN archive (approx. 111 TB of FITS files at the
time of writing this).


# compare-ms-idi.py

Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files or `j2ms2` operation by
comparing different MeasurementSets. This script collects
the integrated weights for all (baseline, source) combinations ("key"
hereafter), and counts how many times each integration is present.

If more than one data set is specified it will report:
- extra keys that are not common to all data sets, and in which data set
these occur
- for each key common across all given data sets compares the integrated
weights and report differences. If those numbers are not equal some data
is missing in some of the data set(s) - it is reported which value was
found in which dataset.
- if a key has duplicated time stamps this means that the same data was
present in the data set more than once and these occurrences are reported
too

Usage:
```bash
> compare-ms-idi.py --ms /path/to/input.ms --idi /path/to/output.idi*
```

Note: the wildcards should not be escaped in this case.

Multiple `--idi` and/or `--ms` options are supported. The script effectively performs a multiway diff across _all_ datasets given on the command line. It is also possible to compare only MeasurementSets, or only FITS-IDI files - the script doesn't care, even when they're (partially) disjoint.

If only one data set is passed on the command line no diff will be computed
but the found keys are displayed - effectively summarizing the data set for
exposure time per baseline per source.

The script has `-h`/`--help`.


### example output

Comparing two MeasurementSets that apparently compare equal despite being created
by two different versions of `j2ms2` - the changes in `j2ms2` have not
affected the data in the produced MeasurementSet:

```bash
> compare-ms-idi.py --ms rsm02-prod-j2ms2.ms --ms rsm02-antdiam-j2ms2.ms
Successful readonly open of default-locked table rsm02-prod-j2ms2.ms: 22 columns, 94752 rows
Successful readonly open of default-locked table rsm02-antdiam-j2ms2.ms: 22 columns, 94752 rows
Checked 2 data sets, 90 common keys
>
```

This is a more interesting case - an instrumented one.
For one experiment a subset of the data was converted to a second (much)
smaller MeasurementSet, which was subsequently converted to FITS-IDI as
well.

The comparison compared the full MeasurementSet, the partial MeasurementSet
and the FITS-IDI file produced from the partial MeasurementSet.

```bash
> compare-ms-idi.py --ms rsm02-dev.ms
--ms rsm02-prod-j2ms2.ms
--idi RSM02-PRODJ2MS2.IDI
```

It produces a lot of output, which can be summarized in three sections:

### The extra keys

The full data set had some stations coming in later and/or sources that were
only observed later than those present in the partial MeasurementSet. As
such there are keys that are only present in the full MeasurementSet.

These types of keys - those that cannot be compared - are listed first:
```
==== Problem report ====
MS: rsm02-dev.ms
Extra keys:
('T6T6', 'J1310+3220') found 89 times
('JbTr', 'J1310+3220') found 89 times
('McYs', 'J1310+3220') found 89 times
...
```

### The common keys that have different values

For keys whose integrated weigth differs between any of the data sets, the
following is displayed:

```
('EfTr', 'J1427+2632') :
6656.00s wgt= 53246.27 3328 times in MS: rsm02-dev.ms
416.00s wgt= 3328.00 208 times in MS: rsm02-prod-j2ms2.ms
416.00s wgt= 3328.00 208 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1419+2706') :
2550.00s wgt= 40775.69 1275 times in MS: rsm02-dev.ms
152.00s wgt= 2432.00 76 times in MS: rsm02-prod-j2ms2.ms
152.00s wgt= 2432.00 76 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1427+2632') :
6448.00s wgt=103158.98 3224 times in MS: rsm02-dev.ms
208.00s wgt= 3328.00 104 times in MS: rsm02-prod-j2ms2.ms
208.00s wgt= 3328.00 104 times in IDI: RSM02-PRODJ2MS2.IDI*
```
It also shows that the data from the partial MeasurementSet and the
corresponding FITS-IDI file is consistent but that a significant amount of
data is missing compared to the full MeasurementSet.

### The summary line

The tool always displays a one-line summary:
```
Checked 3 data sets, 90 common keys with 90 problems identified and 108
non-common keys in 1 formats
```


# compare-ms-idi-meta.py

This script is similar to `compare-ms-idi.py` in operation, only it
compares the meta data in all given data sets:
- antenna properties (position,
offset, mount, diameter)
- source properties (position)
- spectral windows (lowest, highest frequency, number of spectral channels)

Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files and/or check different
`j2ms2` versions producing the same (or explictly verify expected different!) meta data in the MeasurementSets.

The keys used to compare the meta data are antenna name, source name and lowest frequency of
spectral window.

If only one data set is given, or there are no keys with more than one
value, all keys and their associated properties are just displayed. This
mode can be used for quick meta data inspection.

If more than one data set is specified the tool will report:
- extra keys that are not common to all data sets, and in which data set they
occur and how many times
- for each keys common across all given data sets compares the properties
found in each data set. Note that since eacht FITS-IDI chunk (see above)
has its own antenna, source and frequency table; the tool treats each
individual `*.IDIn` chunk as individual data set in order to verify that
_all_ FITS-IDI chunks contain the same meta data.
The report compresses the multi-way diff by taking one reference value
(indicating which data set it came from), computes the diff wrt to all
other values and aggregates data sets by equivalent diff.

Only the properties that are different are displayed and in which data
set(s) these specific values were found


The usage is exactly the same as `compare-ms-idi.py` so only the different
output is shown.

### An extreme example

A use case for this appeared recently. A request to propagate the antenna
diameter from the VEX file, into the MeasurementSet and consequently into
the FITS-IDI file(s) was requested.

A VEX file was hand-edited to supply an antenna diameter of 100.0 m for the
Ef antenna.

In total six files were created:
- MeasurementSet by current production `j2ms2` which does not handle antenna diameter at all
- id. by a fixed `j2ms2` that should propagate from VEX to MeasurementSet
- Each MeasurementSet was converted to FITS-IDI by two versions of
`tConvert`:
- the current production `tConvert`, which does not handle antenna
diameter at all
- the new one that should propagate the antenna diameter from
MeasurementSet to FITS-IDI


Running the six-way meta-data diff, with, for good measures, a (partially)
disjoint different dataset thrown in to highlight the tool's operation, a seven-way meta data diff in one go:

```bash
> compare-ms-idi-meta.py --ms rsm02-prod-j2ms2.ms
--ms rsm02-antdiam-j2ms2.ms
--idi RSM02-PRODJ2MS2-PRODTCONVERT.IDI
--idi RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
--idi ../../eg063d/eg063-prod-dev.IDI*
```

Yields a fair bit of output.

### The extra keys

As with `compare-ms-idi.py`, any keys not common across all data sets are listed first:
Their values are not reported.
```
==== Problem report ====
IDI: ../../eg063d: eg063-prod-dev.IDI*
Extra keys:
('antenna', 'Ar') found 3 times
... snip ...
('frequency', '4926990000.000') found 3 times
... snip ...
('source', '0133+476') found 3 times
... snip ...
```

### The diff report

The most interesting output is for the Ef station - for that is the one
station where the antenna diameter was added in the VEX file. It is expected
that propagation of this value into the actual FITS-IDI file(s) depends on
specifically which combination of `j2ms2` and `tConvert` versions is used.

This is what the tool reports for the key `(antenna, Ef)`:

```
('antenna', 'Ef') :
ANT Ef: xyz=[4033947.2616 486990.7866 4900430.9915] d=0.0 (0)
offset=[0.0145 0. 0. ] mount=alt-az found in:
MS: rsm02-prod-j2ms2.ms
DIFF: diameter: 100.0 vs 0.0 found in: (1)
MS: rsm02-antdiam-j2ms2.ms
DIFF: mount: UnknownMNTSTA#0 vs alt-az found in: (2)
IDI: RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: diameter: 100.0 vs 0.0, mount: UnknownMNTSTA#0 vs alt-az found in: (3)
IDI: RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: offset: [0.013 0. 0. ] vs [0.0145 0. 0. ], mount: (4)
UnknownMNTSTA#0 vs alt-az found in:
IDI: ../../eg063d/eg063-prod-dev.IDI[1-3]
```

Comparing the reference value (0) - the first value found, in
`rsm03-prod-j2ms2.ms`- to values found in the other data sets, four (4)
unique diffs are found.

- (0) The reference value. It contains default antenna diameter of 0.0,
which is used if the diameter is not found in the data set or the default
of 0.0 was written by code that does not propagate the actual value

- (1) The antenna diameter aware `j2ms2` has propagated the 100.0 m diameter
to the MeasurementSet - that is the only diff between the two
MeasurementSets apparently: completely to expectation

- (2) The two versions of `tConvert` can produce the same FITS-IDI file
differences. The diameter unaware `tConvert` does not write it in the
FITS-IDI file and the antenna diameter aware `tConvert` that converts a
MeasurementSet that was produced by the diameter unaware `j2ms2` also
writes the default value of 0.0. Again this is expected.

- (3) This is the fully updated antenna diameter toolchain: antenna aware
`j2ms2` and `tConvert`. The difference with the reference data set
includes the 100.0 m diameter being found in this FITS-IDI file

- (4) The rather disjoint FITS-IDI files from a different experiment thrown
in show that the station offset used in experiment EG063 was different
than the one used in the reference data set.

Finally it shows that the station mount from the MeasurementSet ("alt-az")
does not get translated to a known FITS-IDI enumerated value (see the FITS-IDI standard[^1]).
This may need looking into.

# Prerequisites

- Python casacore module:
`pyrap` (casacore < 2.0), or
`python-casacore` (casacore >= 2.0)

The `python-casacore` package can easily be installed or built from
source, see [the project's github page](https://github.com/casacore/python-casacore)


- `astropy`, for `astropy.io.fits` only
See [the astropy
documentation](https://docs.astropy.org/en/stable/install.html).

Locally, we have good experiences installing `anaconda`([use the free
option](https://www.anaconda.com/pricing)) and adding the
packages to that - which is a very quick way of setting up your system with
a batteries-included scientific software suite.


## Running the tests

Explain how to run the automated tests for this system

### Break down into end to end tests

Explain what these tests test and why

```
Give an example
```

### And coding style tests

Explain what these tests test and why

```
Give an example
```

## Deployment

Add additional notes about how to deploy this on a live system

## Built With

* [Dropwizard](http://www.dropwizard.io/1.0.2/docs/) - The web framework
used
* [Maven](https://maven.apache.org/) - Dependency Management
* [ROME](https://rometools.github.io/rome/) - Used to generate RSS Feeds

## Contributing

Please read
[CONTRIBUTING.md](https://gist.github.com/PurpleBooth/b24679402957c63ec426)
for details on our code of conduct, and the process for submitting pull
requests to us.

## Versioning

We use [SemVer](http://semver.org/) for versioning. For the versions
available, see the [tags on this
repository](https://github.com/your/project/tags).

## Authors

* **Billie Thompson** - *Initial work* -
[PurpleBooth](https://github.com/PurpleBooth)

See also the list of
[contributors](https://github.com/your/project/contributors) who
participated in this project.

## License

This project is licensed under the MIT License - see the
[LICENSE.md](LICENSE.md) file for details

## Acknowledgments

* Hat tip to anyone whose code was used
* Inspiration
* etc

[^1] https://casa.nrao.edu/Memos/229.html
[^2] https://fits.gsfc.nasa.gov/registry/fitsidi.html

Loading…
Cancel
Save