collocate

Collocator.collocate(primary, secondary, max_interval=None, max_distance=None, bin_factor=1, magnitude_factor=10, tunnel_limit=None, start=None, end=None, leaf_size=40)[source]

Find collocations between two xarray.Dataset objects

Collocations are two or more data points that are located close to each other in space and/or time.

Each xarray.Dataset contain the variables time, lat, lon. They must be - if they are coordinates - unique. Otherwise, their coordinates must be unique, i.e. they cannot contain duplicated values. time must be a 1-dimensional array with a numpy.datetime64-like data type. lat and lon can be gridded, i.e. they can be multi- dimensional. However, they must always share the first dimension with time. lat must be latitudes between -90 (south) and 90 (north) and lon must be longitudes between -180 (west) and 180 (east) degrees. See below for examples.

The collocation searched is performed with a fast ball tree implementation by scikit-learn. The ball tree is cached and reused whenever the data points from primary or secondary have not changed.

If you want to find collocations between FileSet objects, use collocate_filesets instead.

Parameters
  • primary – A tuple of a string with the dataset name and a xarray.Dataset that fulfill the specifications from above. Can be also a xarray.Dataset only, the name is then automatically set to primary.

  • secondary – A tuple of a string with the dataset name and a xarray.Dataset that fulfill the specifications from above. Can be also a xarray.Dataset only, the name is then automatically set to secondary.

  • max_interval – Either a number as a time interval in seconds, a string containing a time with a unit (e.g. 100 minutes) or a timedelta object. This is the maximum time interval between two data points. If this is None, the data will be searched for spatial collocations only.

  • max_distance – Either a number as a length in kilometers or a string containing a length with a unit (e.g. 100 meters). This is the maximum distance between two data points to meet the collocation criteria. If this is None, the data will be searched for temporal collocations only. Either max_interval or max_distance must be given.

  • tunnel_limit – Maximum distance in kilometers at which to switch from tunnel to haversine distance metric. Per default this algorithm uses the tunnel metric, which simply transform all latitudes and longitudes to 3D-cartesian space and calculate their euclidean distance. This is faster than the haversine metric but produces an error that grows with larger distances. When searching for distances exceeding this limit (max_distance is greater than this parameter), the haversine metric is used, which is more accurate but takes more time. Default is 1000 kilometers.

  • magnitude_factor – Since building new trees is expensive, this algorithm tries to use the last tree when possible (e.g. for data with fixed grid). However, building the tree with the larger dataset and query it with the smaller dataset is faster than vice versa. Depending on which premise to follow, there might have a different performance in the end. This parameter is the factor of that one dataset must be larger than the other to throw away an already-built ball tree and rebuild it with the larger dataset.

  • leaf_size – The size of one leaf in the Ball Tree. The higher the leaf size the faster is the tree building but the slower is the tree query. The optimal leaf size is dataset-dependent. Default is 40.

  • bin_factor – When using a temporal criterion via max_interval, the data will be temporally binned to speed-up the search. The bin size is bin_factor * max_interval. Which bin factor is the best, may be dataset-dependent. So this is a parameter that you can use to fine-tune the performance.

  • start – Limit the collocated data from this start date. Can be either as datetime object or as string (“YYYY-MM-DD hh:mm:ss”). Year, month and day are required. Hours, minutes and seconds are optional. If not given, it is datetime.min per default.

  • end – End date. Same format as “start”. If not given, it is datetime.max per default.

Returns

None if no collocations were found. Otherwise, a xarray.Dataset with the collocated data in compact form. It consists of three groups (groups of variables containing / in their name): the primary, secondary and the Collocations group. If you passed primary or secondary with own names, they will be used in the output. The Collocations group contains information about the found collocations. Collocations/pairs is a 2xN array where N is the number of found collocations. It contains the indices of the primary and secondary data points which are collocations. The indices refer to the data points stored in the primary or secondary group. Collocations/interval and Collocations/distance are the intervals and distances between the collocations in seconds and kilometers, respectively. Collocations in compact form are efficient when saving them to disk but it might be complicated to use them directly. Consider applying collapse() or expand() on them.

Examples