Преглед на файлове

Documentation

README.md listing all tools in this repo and a short paragraph about what
they do.

Specific tool documentation in tool-specific .md file (e.g. compare-ms-idi.md), which is linked-to from inside README.md, containing usage notes, brief explanation of what they check and example, annotated, output.
diff-output-compression
haavee преди 9 месеца
родител
ревизия
f28a091526
променени са 4 файла, в които са добавени 344 реда и са изтрити 355 реда
  1. +27
    -355
      README.md
  2. +54
    -0
      check-multipart-fits.md
  3. +150
    -0
      compare-ms-idi-meta.md
  4. +113
    -0
      compare-ms-idi.md

+ 27
- 355
README.md Целия файл

@@ -1,8 +1,8 @@
# JIVE-toolchain-verify

At JIVE several data conversions take place. The most-used ones are from correlator output to CASA
MeasurementSet v2[^1] (`j2ms2`), and from MeasurementSet v2 to FITS-IDI (`tConvert`), the FITS Interferometry Data
Interchange, format[^2].
MeasurementSet v2<sup>[1](#msv2)</sup> (`j2ms2`), and from MeasurementSet v2 to FITS-IDI (`tConvert`), the FITS Interferometry Data
Interchange, format<sup>[2](#fitsidi)</sup>.

No tools existed yet to verify whether those data format transformation
utilities (`j2ms2`, `tConvert`) actually did their work correctly - i.e. losing, adding, or corrupting (meta-)data.
@@ -35,62 +35,30 @@ so
> /path/to/python-X.Y.Z /path/to/compare-ms-idi.py [options]
```

# check-multipart-fits.py
# `fix-uvf.py`

Checks if there is data loss between the boundaries of subsequent IDI files
produced from a single MeasurementSet. Reports how much observing time (in
seconds) was lost across all boundaries.

## Usage

The script does not have a `-h`/`--help` command line option. In fact it has
no options at all.

Run as follows:
```bash
> check-multipart-fits.py '/path/to/exp/*.IDI*' '/path/to/other/*.IDI*'
```
and wait. The `'`'s are necessary in order to let the script expand the
wildcards in stead of the shell.
Not a checking tool, actually. It is used to fix UV-FITS files.

Note that hierarchical wildcards are supported too to dig through
directories of directories (etc) of FITS-IDI files:
```bash
> check-multipart-fits.py '/path/to/*/*/*.IDI*'
> fix-uvf.py /path/to/file.UVF
```

### Background
JIVE limits the IDI fits files to <= 2 GB per IDI file in order to not upset
32-bit software. If a MeasurementSet is larger than this, `tConvert` will
break it up into chunks of at must 2 GB / piece, naming the files

```bash
> tConvert input.ms OUTPUT.IDI
...
> ls OUTPUT*
OUTPUT.IDI1
OUTPUT.IDI2
...
OUTPUT.IDIn
```
Will do the following edits in-place in `file.UVF`
- Change the name of the `FREQSEL` column into `FRQSEL` since that is what AIPS is looking for
- Add the `SORTORDR = 'TB'` header keyword. JIVE has always written data in time, baseline order but failed to indicate that in the file header, causing an unnecessary resort after delivery to the PI.

Due to a bug/erroneous assumption made inside tConvert, an unfixed version
could trigger data loss between `OUTPUT.IDIm` and `OUTPUT.IDI(m+1)`.
UV-FITS is a deprecated flavour of FITS for radio-interferometric data transport. The format has no proper documentation and the data cannot be broken in smaller chunks - all data is in one (1) file.

This tool checks the time stamps at start of `m+1` with the last time stamp
in `m` and sums up all the differences across all `m-1` boundaries.

There is a caveat that if a scan boundary/gap would be such that _all_ data from
the last scan before the gap is in chunks `<=m` and data for the first scan
after the gap is in chunk `m+1`, that would count as a loss of `gap
duration` seconds.
# `check-multipart-fits.py`

However the chances of that are minimal and such occurrences have not yet been
observed checking the whole EVN archive (approx. 111 TB of FITS files at the
time of writing this).
Checks if there is data loss between the boundaries of subsequent IDI files
produced from a single MeasurementSet. Reports how much observing time (in
seconds) was lost across all boundaries.

Full documentation and explanation is in [check-multipart-fits.md](check-multipart-fits.md)

# compare-ms-idi.py
# `compare-ms-idi.py`

Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files or `j2ms2` operation by
@@ -98,118 +66,13 @@ comparing different MeasurementSets. This script collects
the integrated weights for all (baseline, source) combinations ("key"
hereafter), and counts how many times each integration is present.

If more than one data set is specified it will report:
- extra keys that are not common to all data sets, and in which data set
these occur
- for each key common across all given data sets compares the integrated
weights and report differences. If those numbers are not equal some data
is missing in some of the data set(s) - it is reported which value was
found in which dataset.
- if a key has duplicated time stamps this means that the same data was
present in the data set more than once and these occurrences are reported
too

Usage:
```bash
> compare-ms-idi.py --ms /path/to/input.ms --idi /path/to/output.idi*
```

Note: the wildcards should not be escaped in this case.

Multiple `--idi` and/or `--ms` options are supported. The script effectively performs a multiway diff across _all_ datasets given on the command line. It is also possible to compare only MeasurementSets, or only FITS-IDI files - the script doesn't care, even when they're (partially) disjoint.

If only one data set is passed on the command line no diff will be computed
but the found keys are displayed - effectively summarizing the data set for
exposure time per baseline per source.

The script has `-h`/`--help`.


### example output

Comparing two MeasurementSets that apparently compare equal despite being created
by two different versions of `j2ms2` - the changes in `j2ms2` have not
affected the data in the produced MeasurementSet:

```bash
> compare-ms-idi.py --ms rsm02-prod-j2ms2.ms --ms rsm02-antdiam-j2ms2.ms
Successful readonly open of default-locked table rsm02-prod-j2ms2.ms: 22 columns, 94752 rows
Successful readonly open of default-locked table rsm02-antdiam-j2ms2.ms: 22 columns, 94752 rows
Checked 2 data sets, 90 common keys
>
```

This is a more interesting case - an instrumented one.
For one experiment a subset of the data was converted to a second (much)
smaller MeasurementSet, which was subsequently converted to FITS-IDI as
well.
Full documentation and explanation is in [compare-ms-idi.md](compare-ms-idi.md)

The comparison compared the full MeasurementSet, the partial MeasurementSet
and the FITS-IDI file produced from the partial MeasurementSet.

```bash
> compare-ms-idi.py --ms rsm02-dev.ms
--ms rsm02-prod-j2ms2.ms
--idi RSM02-PRODJ2MS2.IDI
```

It produces a lot of output, which can be summarized in three sections:

### The extra keys

The full data set had some stations coming in later and/or sources that were
only observed later than those present in the partial MeasurementSet. As
such there are keys that are only present in the full MeasurementSet.

These types of keys - those that cannot be compared - are listed first:
```
==== Problem report ====
MS: rsm02-dev.ms
Extra keys:
('T6T6', 'J1310+3220') found 89 times
('JbTr', 'J1310+3220') found 89 times
('McYs', 'J1310+3220') found 89 times
...
```

### The common keys that have different values

For keys whose integrated weigth differs between any of the data sets, the
following is displayed:

```
('EfTr', 'J1427+2632') :
6656.00s wgt= 53246.27 3328 times in MS: rsm02-dev.ms
416.00s wgt= 3328.00 208 times in MS: rsm02-prod-j2ms2.ms
416.00s wgt= 3328.00 208 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1419+2706') :
2550.00s wgt= 40775.69 1275 times in MS: rsm02-dev.ms
152.00s wgt= 2432.00 76 times in MS: rsm02-prod-j2ms2.ms
152.00s wgt= 2432.00 76 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1427+2632') :
6448.00s wgt=103158.98 3224 times in MS: rsm02-dev.ms
208.00s wgt= 3328.00 104 times in MS: rsm02-prod-j2ms2.ms
208.00s wgt= 3328.00 104 times in IDI: RSM02-PRODJ2MS2.IDI*
```
It also shows that the data from the partial MeasurementSet and the
corresponding FITS-IDI file is consistent but that a significant amount of
data is missing compared to the full MeasurementSet.

### The summary line

The tool always displays a one-line summary:
```
Checked 3 data sets, 90 common keys with 90 problems identified and 108
non-common keys in 1 formats
```


# compare-ms-idi-meta.py
# `compare-ms-idi-meta.py`

This script is similar to `compare-ms-idi.py` in operation, only it
compares the meta data in all given data sets:
- antenna properties (position,
offset, mount, diameter)
- antenna properties (position, offset, mount, diameter)
- source properties (position)
- spectral windows (lowest, highest frequency, number of spectral channels)

@@ -217,146 +80,17 @@ Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files and/or check different
`j2ms2` versions producing the same (or explictly verify expected different!) meta data in the MeasurementSets.

The keys used to compare the meta data are antenna name, source name and lowest frequency of
spectral window.

If only one data set is given, or there are no keys with more than one
value, all keys and their associated properties are just displayed. This
mode can be used for quick meta data inspection.

If more than one data set is specified the tool will report:
- extra keys that are not common to all data sets, and in which data set they
occur and how many times
- for each keys common across all given data sets compares the properties
found in each data set. Note that since eacht FITS-IDI chunk (see above)
has its own antenna, source and frequency table; the tool treats each
individual `*.IDIn` chunk as individual data set in order to verify that
_all_ FITS-IDI chunks contain the same meta data.
The report compresses the multi-way diff by taking one reference value
(indicating which data set it came from), computes the diff wrt to all
other values and aggregates data sets by equivalent diff.

Only the properties that are different are displayed and in which data
set(s) these specific values were found


The usage is exactly the same as `compare-ms-idi.py` so only the different
output is shown.

### An extreme example
Full documentation and explanation is in [compare-ms-idi-meta.md](compare-ms-idi-meta.md)

A use case for this appeared recently. A request to propagate the antenna
diameter from the VEX file, into the MeasurementSet and consequently into
the FITS-IDI file(s) was requested.

A VEX file was hand-edited to supply an antenna diameter of 100.0 m for the
Ef antenna.

In total six files were created:
- MeasurementSet by current production `j2ms2` which does not handle antenna diameter at all
- id. by a fixed `j2ms2` that should propagate from VEX to MeasurementSet
- Each MeasurementSet was converted to FITS-IDI by two versions of
`tConvert`:
- the current production `tConvert`, which does not handle antenna
diameter at all
- the new one that should propagate the antenna diameter from
MeasurementSet to FITS-IDI


Running the six-way meta-data diff, with, for good measures, a (partially)
disjoint different dataset thrown in to highlight the tool's operation, a seven-way meta data diff in one go:

```bash
> compare-ms-idi-meta.py --ms rsm02-prod-j2ms2.ms
--ms rsm02-antdiam-j2ms2.ms
--idi RSM02-PRODJ2MS2-PRODTCONVERT.IDI
--idi RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
--idi ../../eg063d/eg063-prod-dev.IDI*
```

Yields a fair bit of output.

### The extra keys

As with `compare-ms-idi.py`, any keys not common across all data sets are listed first:
Their values are not reported.
```
==== Problem report ====
IDI: ../../eg063d: eg063-prod-dev.IDI*
Extra keys:
('antenna', 'Ar') found 3 times
... snip ...
('frequency', '4926990000.000') found 3 times
... snip ...
('source', '0133+476') found 3 times
... snip ...
```

### The diff report

The most interesting output is for the Ef station - for that is the one
station where the antenna diameter was added in the VEX file. It is expected
that propagation of this value into the actual FITS-IDI file(s) depends on
specifically which combination of `j2ms2` and `tConvert` versions is used.

This is what the tool reports for the key `(antenna, Ef)`:

```
('antenna', 'Ef') :
ANT Ef: xyz=[4033947.2616 486990.7866 4900430.9915] d=0.0 (0)
offset=[0.0145 0. 0. ] mount=alt-az found in:
MS: rsm02-prod-j2ms2.ms
DIFF: diameter: 100.0 vs 0.0 found in: (1)
MS: rsm02-antdiam-j2ms2.ms
DIFF: mount: UnknownMNTSTA#0 vs alt-az found in: (2)
IDI: RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: diameter: 100.0 vs 0.0, mount: UnknownMNTSTA#0 vs alt-az found in: (3)
IDI: RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: offset: [0.013 0. 0. ] vs [0.0145 0. 0. ], mount: (4)
UnknownMNTSTA#0 vs alt-az found in:
IDI: ../../eg063d/eg063-prod-dev.IDI[1-3]
```

Comparing the reference value (0) - the first value found, in
`rsm03-prod-j2ms2.ms`- to values found in the other data sets, four (4)
unique diffs are found.

- (0) The reference value. It contains default antenna diameter of 0.0,
which is used if the diameter is not found in the data set or the default
of 0.0 was written by code that does not propagate the actual value

- (1) The antenna diameter aware `j2ms2` has propagated the 100.0 m diameter
to the MeasurementSet - that is the only diff between the two
MeasurementSets apparently: completely to expectation

- (2) The two versions of `tConvert` can produce the same FITS-IDI file
differences. The diameter unaware `tConvert` does not write it in the
FITS-IDI file and the antenna diameter aware `tConvert` that converts a
MeasurementSet that was produced by the diameter unaware `j2ms2` also
writes the default value of 0.0. Again this is expected.

- (3) This is the fully updated antenna diameter toolchain: antenna aware
`j2ms2` and `tConvert`. The difference with the reference data set
includes the 100.0 m diameter being found in this FITS-IDI file

- (4) The rather disjoint FITS-IDI files from a different experiment thrown
in show that the station offset used in experiment EG063 was different
than the one used in the reference data set.

Finally it shows that the station mount from the MeasurementSet ("alt-az")
does not get translated to a known FITS-IDI enumerated value (see the FITS-IDI standard[^1]).
This may need looking into.

# Prerequisites

All tools require extra Python modules to access data in CASA
MeasurementSet and FITS format:

- Python casacore module:
`pyrap` (casacore < 2.0), or
`python-casacore` (casacore >= 2.0)
- `pyrap` (casacore < 2.0), or
- `python-casacore` (casacore >= 2.0)

The `python-casacore` package can easily be installed or built from
source, see [the project's github page](https://github.com/casacore/python-casacore)
@@ -372,69 +106,7 @@ packages to that - which is a very quick way of setting up your system with
a batteries-included scientific software suite.


## Running the tests

Explain how to run the automated tests for this system

### Break down into end to end tests

Explain what these tests test and why

```
Give an example
```

### And coding style tests

Explain what these tests test and why

```
Give an example
```

## Deployment

Add additional notes about how to deploy this on a live system

## Built With

* [Dropwizard](http://www.dropwizard.io/1.0.2/docs/) - The web framework
used
* [Maven](https://maven.apache.org/) - Dependency Management
* [ROME](https://rometools.github.io/rome/) - Used to generate RSS Feeds

## Contributing

Please read
[CONTRIBUTING.md](https://gist.github.com/PurpleBooth/b24679402957c63ec426)
for details on our code of conduct, and the process for submitting pull
requests to us.

## Versioning

We use [SemVer](http://semver.org/) for versioning. For the versions
available, see the [tags on this
repository](https://github.com/your/project/tags).

## Authors

* **Billie Thompson** - *Initial work* -
[PurpleBooth](https://github.com/PurpleBooth)

See also the list of
[contributors](https://github.com/your/project/contributors) who
participated in this project.

## License

This project is licensed under the MIT License - see the
[LICENSE.md](LICENSE.md) file for details

## Acknowledgments

* Hat tip to anyone whose code was used
* Inspiration
* etc
# References

[^1] https://casa.nrao.edu/Memos/229.html
[^2] https://fits.gsfc.nasa.gov/registry/fitsidi.html
<a name="msv2">1</a>: https://casa.nrao.edu/Memos/229.html<br>
<a name="fitsidi">2</a>: https://fits.gsfc.nasa.gov/registry/fitsidi.html

+ 54
- 0
check-multipart-fits.md Целия файл

@@ -0,0 +1,54 @@
# `check-multipart-fits.py`

Checks if there is data loss between the boundaries of subsequent IDI files
produced from a single MeasurementSet. Reports how much observing time (in
seconds) was lost across all boundaries.

## Usage

The script does not have a `-h`/`--help` command line option. In fact it has
no options at all.

Run as follows:
```bash
> check-multipart-fits.py '/path/to/exp/*.IDI*' '/path/to/other/*.IDI*'
```
and wait. The `'`'s are necessary in order to let the script expand the
wildcards in stead of the shell.

Note that hierarchical wildcards are supported too to dig through
directories of directories (etc) of FITS-IDI files:
```bash
> check-multipart-fits.py '/path/to/*/*/*.IDI*'
```

### Background
JIVE limits the IDI fits files to <= 2 GB per IDI file in order to not upset
32-bit software. If a MeasurementSet is larger than this, `tConvert` will
break it up into chunks of at most 2 GB / piece, naming the files

```bash
> tConvert input.ms OUTPUT.IDI
...
> ls OUTPUT*
OUTPUT.IDI1
OUTPUT.IDI2
...
OUTPUT.IDIn
```

Due to a bug/erroneous assumption made inside tConvert, an unfixed version
could trigger data loss between `OUTPUT.IDIm` and `OUTPUT.IDI(m+1)`.

This tool checks the time stamps at start of `m+1` with the last time stamp
in `m` and sums up all the differences across all `m-1` boundaries.

There is a caveat that if a scan boundary/gap would be such that _all_ data from
the last scan before the gap is in chunks <=`m` and data for the first scan
after the gap is in chunk `m+1`, that would count as a loss of `gap
duration` seconds.

However the chances of that are minimal and such occurrences have not yet been
observed checking the whole EVN archive (approx. 111 TB of FITS files at the
time of writing this).


+ 150
- 0
compare-ms-idi-meta.md Целия файл

@@ -0,0 +1,150 @@
# `compare-ms-idi-meta.py`

This script is similar to `compare-ms-idi.py` in operation, only it
compares the meta data in all given data sets:
- antenna properties (position, offset, mount, diameter)
- source properties (position)
- spectral windows (lowest, highest frequency, number of spectral channels)

Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files and/or check different
`j2ms2` versions producing the same (or explictly verify expected different!) meta data in the MeasurementSets.

## The keys on which data is compared

The keys used to compare the meta data are antenna name, source name and lowest frequency of spectral window and are printed as `('antenna', 'station name')`, `('source', 'source name')` and `('frequency', 'frequency-in-Hz')`

## Operation with one or more input files

If only one data set is given, or there are no keys with more than one
value, all keys and their associated properties are just displayed. This
mode can be used for quick meta data inspection.

If more than one data set is specified the tool will report:
- extra keys that are not common to all data sets, and in which data set they occur and how many times
- for each keys common across all given data sets compares the properties
found in each data set.
Note that since each FITS-IDI chunk (see above)
has its own antenna, source and frequency table; the tool treats each
individual `*.IDIn` chunk as individual data set in order to verify that
_all_ FITS-IDI chunks contain the same meta data.
The report compresses the multi-way diff by taking one reference value
(indicating which data set it came from), computes the diff wrt to all
other values and aggregates data sets by equivalent diff.

Only the properties that are different are displayed and in which data
set(s) these specific values were found


The usage is exactly the same as `compare-ms-idi.py` so only the different
output is shown.

## An extreme example

A use case for this appeared recently. A request to propagate the antenna
diameter from the VEX file, into the MeasurementSet and consequently into
the FITS-IDI file(s) was requested.

A VEX file was hand-edited to supply an antenna diameter of 100.0 m for the
Ef antenna.

In total six files were created:
- a MeasurementSet by current production `j2ms2` which does not handle antenna diameter at all
- id. by a fixed `j2ms2` that should propagate from VEX to MeasurementSet
- Each MeasurementSet was converted to FITS-IDI by two versions of
`tConvert`:
- the current production `tConvert`, which does not handle antenna
diameter at all
- the new one that should propagate the antenna diameter from
MeasurementSet to FITS-IDI


Running the six-way meta-data diff, with, for good measures, a (partially)
disjoint different dataset thrown in to highlight the tool's operation, a seven-way meta data diff in one go:

```bash
> compare-ms-idi-meta.py --ms rsm02-prod-j2ms2.ms
--ms rsm02-antdiam-j2ms2.ms
--idi RSM02-PRODJ2MS2-PRODTCONVERT.IDI
--idi RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
--idi RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
--idi ../../eg063d/eg063-prod-dev.IDI*
```

yields a fair bit of output.

### The extra keys

As with `compare-ms-idi.py`, any keys not common across all data sets are listed first:
Their values are not reported.
```
==== Problem report ====
IDI: ../../eg063d: eg063-prod-dev.IDI*
Extra keys:
('antenna', 'Ar') found 3 times
... snip ...
('frequency', '4926990000.000') found 3 times
... snip ...
('source', '0133+476') found 3 times
... snip ...
```

### The diff report

The most interesting output is for the Ef station - for that is the one
station where the antenna diameter was added in the VEX file. It is expected
that propagation of this value into the actual FITS-IDI file(s) depends on
specifically which combination of `j2ms2` and `tConvert` versions is used.

This is what the tool reports for the key `(antenna, Ef)`:

```
('antenna', 'Ef') :
ANT Ef: xyz=[4033947.2616 486990.7866 4900430.9915] d=0.0 (0)
offset=[0.0145 0. 0. ] mount=alt-az found in:
MS: rsm02-prod-j2ms2.ms
DIFF: diameter: 100.0 vs 0.0 found in: (1)
MS: rsm02-antdiam-j2ms2.ms
DIFF: mount: UnknownMNTSTA#0 vs alt-az found in: (2)
IDI: RSM02-ANTDIAMJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-PRODTCONVERT.IDI
IDI: RSM02-PRODJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: diameter: 100.0 vs 0.0, (3)
mount: UnknownMNTSTA#0 vs alt-az found in:
IDI: RSM02-ANTDIAMJ2MS2-ANTDIAMTCONVERT.IDI
DIFF: offset: [0.013 0. 0. ] vs [0.0145 0. 0. ],(4)
mount: UnknownMNTSTA#0 vs alt-az found in:
IDI: ../../eg063d/eg063-prod-dev.IDI[1-3]
```

Comparing the reference value **(0)** - the first value found, in
`rsm03-prod-j2ms2.ms`- to values found in the other data sets, four (4)
unique diffs are found.

- **(0)** The reference value. It contains default antenna diameter of 0.0,
which is used if the diameter is not found in the data set or the default
of 0.0 was written by code that does not propagate the actual value

- **(1)** The antenna diameter aware `j2ms2` has propagated the 100.0 m diameter
to the MeasurementSet - that is the only diff between the two
MeasurementSets apparently: completely to expectation

- **(2)** The two versions of `tConvert` can produce the same FITS-IDI file
differences. The diameter unaware `tConvert` does not write it in the
FITS-IDI file and the antenna diameter aware `tConvert` that converts a
MeasurementSet that was produced by the diameter unaware `j2ms2` also
writes the default value of 0.0. Again this is expected.

- **(3)** This is the fully updated antenna diameter toolchain: antenna aware
`j2ms2` and `tConvert`. The difference with the reference data set
includes the 100.0 m diameter being found in this FITS-IDI file

- **(4)** The rather disjoint FITS-IDI files from a different experiment thrown
in show that the station offset used in experiment EG063 was different
than the one used in the reference data set.

Finally it shows that the station mount from the MeasurementSet ("alt-az")
does not get translated to a known FITS-IDI enumerated value (see the FITS-IDI standard[^1]).
This may need looking into.

+ 113
- 0
compare-ms-idi.md Целия файл

@@ -0,0 +1,113 @@
# `compare-ms-idi.py`

Use this script to verify `tConvert`'s operation when it comes to exporting
all MeasurementSet data to a (set of) FITS-IDI files or `j2ms2` operation by
comparing different MeasurementSets. This script collects
the integrated weights for all (baseline, source) combinations ("key"
hereafter), and counts how many times each integration is present.

If more than one data set is specified it will report:
- extra keys that are not common to all data sets, and in which data set
these occur
- for each key common across all given data sets, the tool will compare the integrated
weights and report differences. If those numbers are not equal some data
is missing in some of the data set(s) - it is reported which value was
found in which dataset.
- if a key has duplicated time stamps this means that the same data was
present in the data set more than once and these occurrences are reported
too

Usage:
```bash
> compare-ms-idi.py --ms /path/to/input.ms
--idi /path/to/output.idi*
```

Note: the wildcards should not be escaped in this case.

Multiple `--idi` and/or `--ms` options are supported. The script effectively performs a multiway diff across _all_ datasets given on the command line. It is also possible to compare only MeasurementSets, or only FITS-IDI files - the script doesn't care, even when they're (partially) disjoint.

If only one data set is passed on the command line no diff will be computed but the found keys are displayed - effectively summarizing the data set for exposure time per baseline per source.

The script has `-h`/`--help`.


## Example output

**The uninteresting case**

This run compares two MeasurementSets that end up comparing equal, despite being created by two different versions of `j2ms2` - the changes in `j2ms2` have not affected the data in the produced MeasurementSet:

```bash
> compare-ms-idi.py --ms rsm02-prod-j2ms2.ms
--ms rsm02-antdiam-j2ms2.ms
Successful readonly open of default-locked table rsm02-prod-j2ms2.ms: 22 columns, 94752 rows
Successful readonly open of default-locked table rsm02-antdiam-j2ms2.ms: 22 columns, 94752 rows
Checked 2 data sets, 90 common keys
>
```

**This is a more interesting case - an instrumented one.**

For one experiment a subset of the data was converted to a second (much)
smaller MeasurementSet, which was subsequently converted to FITS-IDI as
well.

The comparison compared the full MeasurementSet, the partial MeasurementSet
and the FITS-IDI file produced from the partial MeasurementSet.

```bash
> compare-ms-idi.py --ms rsm02-dev.ms
--ms rsm02-prod-j2ms2.ms
--idi RSM02-PRODJ2MS2.IDI
```

It produces a lot of output, which can be summarized in three sections:

## The extra keys

The full data set had some stations coming in later and/or sources that were
only observed later than those present in the partial MeasurementSet. As
such there are keys that are only present in the full MeasurementSet.

These types of keys - those that cannot be compared - are listed first:
```
==== Problem report ====
MS: rsm02-dev.ms
Extra keys:
('T6T6', 'J1310+3220') found 89 times
('JbTr', 'J1310+3220') found 89 times
('McYs', 'J1310+3220') found 89 times
...
```

### The common keys that have different values

For keys whose integrated weight differs between any of the data sets, the
following is displayed:

```
('EfTr', 'J1427+2632') :
6656.00s wgt= 53246.27 3328 times in MS: rsm02-dev.ms
416.00s wgt= 3328.00 208 times in MS: rsm02-prod-j2ms2.ms
416.00s wgt= 3328.00 208 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1419+2706') :
2550.00s wgt= 40775.69 1275 times in MS: rsm02-dev.ms
152.00s wgt= 2432.00 76 times in MS: rsm02-prod-j2ms2.ms
152.00s wgt= 2432.00 76 times in IDI: RSM02-PRODJ2MS2.IDI*
('EfYs', 'J1427+2632') :
6448.00s wgt=103158.98 3224 times in MS: rsm02-dev.ms
208.00s wgt= 3328.00 104 times in MS: rsm02-prod-j2ms2.ms
208.00s wgt= 3328.00 104 times in IDI: RSM02-PRODJ2MS2.IDI*
```
It also shows that the data from the partial MeasurementSet and the
corresponding FITS-IDI file is consistent but that a significant amount of
data is missing compared to the full MeasurementSet.

### The summary line

The tool always displays a one-line summary:
```
Checked 3 data sets, 90 common keys with 90 problems identified and 108
non-common keys in 1 formats
```

Зареждане…
Отказ
Запис