pymove.preprocessing package

Submodules

pymove.preprocessing.compression module

Compression operations.

compress_segment_stop_to_point

pymove.preprocessing.compression.compress_segment_stop_to_point(move_data: pandas.core.frame.DataFrame, label_segment: str = 'segment_stop', label_stop: str = 'stop', point_mean: str = 'default', drop_moves: bool = False, label_id: str = 'id', dist_radius: float = 30, time_radius: float = 900, inplace: bool = False) → pandas.core.frame.DataFrame[source]

Compress the trajectories using the stop points in the dataframe.

Compress a segment to point setting lat_mean e lon_mean to each segment.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_segment (String, optional) – The label of the column containing the ids of the formed segments. Is the new splitted id, by default SEGMENT_STOP
  • label_stop (String, optional) – Is the name of the column that indicates if a point is a stop, by default STOP
  • point_mean (String, optional) – Indicates whether the mean points should be calculated using centroids or the point that repeat the most, by default ‘default’
  • drop_moves (Boolean, optional) – If set to true, the moving points will be dropped from the dataframe, by default False
  • label_id (String, optional) – Used to create the stay points used in the compression. If the dataset already has the stop move, this parameter should be ignored. Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • dist_radius (Double, optional) – Used to create the stay points used in the compression, by default 30 If the dataset already has the stop move, this parameter should be ignored. The first step in this function is segmenting the trajectory. The segments are used to find the stop points. The dist_radius defines the distance used in the segmentation.
  • time_radius (Double, optional) –

    Used to create the stay points used in the compression, by default 900 If the dataset already has the stop move, this

    parameter should be ignored.

    The time_radius used to determine if a segment is a stop. If the user stayed in the segment for a time greater than time_radius, than the segment is a stop.

  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Data with 3 additional features: segment_stop, lat_mean and lon_mean or None segment_stop indicates the trajectory segment to which the point belongs lat_mean and lon_mean:

if the default option is used, lat_mean and lon_mean are defined based on point that repeats most within the segment On the other hand, if centroid option is used, lat_mean and lon_mean are defined by centroid of the all points into segment

Return type:

DataFrame

pymove.preprocessing.filters module

Filtering operations.

get_bbox_by_radius, by_bbox, by_datetime, by_label, by_id, by_tid, clean_consecutive_duplicates, clean_gps_jumps_by_distance, clean_gps_nearby_points_by_distances, clean_gps_nearby_points_by_speed, clean_gps_speed_max_radius, clean_trajectories_with_few_points, clean_trajectories_short_and_few_points, clean_id_by_time_max

pymove.preprocessing.filters.by_bbox(move_data: DataFrame, bbox: tuple[float, float, float, float], filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]

Filters points of the trajectories according to specified bounding box.

Parameters:
  • move_data (dataframe) – The input trajectories data
  • bbox (tuple) – Tuple of 4 elements, containing the minimum and maximum values of latitude and longitude of the bounding box.
  • filter_out (boolean, optional) – If set to false the function will return the trajectories points within the bounding box, and the points outside otherwise, by default False
  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Returns dataframe with trajectories points filtered by bounding box or None

Return type:

DataFrame

pymove.preprocessing.filters.by_datetime(move_data: DataFrame, start_datetime: str | None = None, end_datetime: str | None = None, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]

Filters trajectories points according to specified time range.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • start_datetime (str) – The start date and time (Datetime format) of the time range, by default None
  • end_datetime (str) – The end date and time (Datetime format) of the time range, by default None
  • filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Returns dataframe with trajectories points filtered by time range or None

Return type:

DataFrame

pymove.preprocessing.filters.by_id(move_data: DataFrame, id_: int | None = None, label_id: str = 'id', filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]

Filters trajectories points according to specified trajectory id.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • id (int) – Specifies the number of the id used to filter the trajectories points
  • label_id (str, optional) – The label of the column which contains the id of the trajectories, by default TRAJ_ID
  • filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Returns dataframe with trajectories points filtered by id or None

Return type:

DataFrame

pymove.preprocessing.filters.by_label(move_data: DataFrame, value: Any, label_name: str, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]

Filters trajectories points according to specified value and column label.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • value (The value to be use to filter the trajectories) – Specifies the value used to filter the trajectories points
  • label_name (str) – Specifies the label of the column used in the filtering
  • filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Returns dataframe with trajectories points filtered by label or None

Return type:

DataFrame

pymove.preprocessing.filters.by_tid(move_data: DataFrame, tid_: str | None = None, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]

Filters trajectories points according to a specified trajectory tid.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • tid (str) – Specifies the number of the tid used to filter the trajectories points
  • label_tid (str, optional) – The label of the column in the user dataframe which contains the tid of the trajectories, by default None
  • filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

Returns a dataframe with trajectories points filtered or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_consecutive_duplicates(move_data: DataFrame, subset: int | str | None = None, keep: str | bool = 'first', inplace: bool = False) → DataFrame | None[source]

Removes consecutive duplicate rows of the Dataframe.

Optionally only certain columns can be consider.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • subset (Array of str, optional) – Specifies Column label or sequence of labels, considered for identifying duplicates, by default None
  • keep ('first', 'last', optional) – If keep is set as first, all the duplicates except for the first occurrence will be dropped. On the other hand if set to last, all duplicates except for the last occurrence will be dropped. If set to False, all duplicates are dropped. by default ‘first’
  • inplace (boolean, optional) – if set to true the original dataframe will be altered, the duplicates will be dropped in place, otherwise a copy will be returned, by default False
Returns:

The filtered trajectories points without consecutive duplicates or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_gps_jumps_by_distance(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', jump_coefficient: float = 3.0, threshold: float = 1, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Removes the trajectories points that are outliers from the dataframe.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • jump_coefficient (float, optional) – by default 3
  • threshold (float, optional) – Minimum value that the distance features must have in order to be considered outliers, by default 1
  • label_dtype (type, optional) – Represents column id type, by default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories without the gps jumps or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_gps_nearby_points_by_distances(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', radius_area: float = 10.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Removes points from the trajectories with smaller distance from the point before.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • radius_area (float, optional) – Species the minimum distance a point must have to it”srs previous point in order not to be dropped, by default 10
  • label_dtype (type, optional) – Represents column id type, ,y default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, be default False
Returns:

The filtered trajectories without the gps nearby points by distance or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_gps_nearby_points_by_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', speed_radius: float = 0.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Removes points from the trajectories with smaller speed of travel.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, be defalt TRAJ_ID
  • speed_radius (float, optional) – Species the minimum speed a point must have from it”srs previous point, in order not to be dropped, by default 0
  • label_dtype (type, optional) – Represents column id type, by default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories without the gps nearby points by speed or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_gps_speed_max_radius(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', speed_max: float = 50.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Removes trajectories points with higher speed.

Given any point p of the trajectory, the point will be removed if one of the following happens: if the travel speed from the point before p to p is greater than the max value of speed between adjacent points set by the user. Or the travel speed between point p and the next point is greater than the value set by the user. When the cleaning is done, the function will update the time and distance features in the dataframe and will call itself again. The function will finish processing when it can no longer find points disrespecting the limit of speed.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • speed_max (float, optional) – Indicates the maximum value a point speed_to_prev and speed_to_next should have, in order not to be dropped, by default 50
  • label_dtype (type, optional) – Represents column id type, by default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories without the gps nearby points or None

Return type:

DataFrame

pymove.preprocessing.filters.clean_id_by_time_max(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', time_max: float = 3600, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Clears GPS points with time by ID greater than a user-defined limit.

Parameters:
  • move_data (dataframe.) – The input data.
  • label_id (str, optional) – The label of the column which contains the id of the trajectories, by default TRAJ_ID
  • time_max (float, optional) – Indicates the maximum value time a set of points with the same id should have in order not to be dropped, by default 3600
  • label_dtype (type, optional) – Represents column id type, by default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories with the maximum time.

Return type:

dataframe or None

pymove.preprocessing.filters.clean_trajectories_short_and_few_points(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'tid', min_trajectory_distance: float = 100, min_points_per_trajectory: int = 2, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Eliminates from the given dataframe trajectories with fewer points and shorter length.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – The label of the column which contains the tid of the trajectories, by default TID
  • min_trajectory_distance (float, optional) – Specifies the minimun length a trajectory must have in order not to be dropped, by default 100
  • min_points_per_trajectory (int, optional) – Specifies the minimun number of points a trajectory must have in order not to be dropped, by default 2
  • label_dtype (type, optional) – Represents column id type, by default np.float64.
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories with the minimum gps points and distance or None

Return type:

DataFrame

Notes

remove_tids_with_few_points must be performed before updating features.

pymove.preprocessing.filters.clean_trajectories_with_few_points(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_tid: str = 'tid', min_points_per_trajectory: int = 2, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Removes from the given dataframe, trajectories with fewer points.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_tid (str, optional) – The label of the column which contains the tid of the trajectories, by default TID
  • min_points_per_trajectory (int, optional) – Specifies the minimum number of points a trajectory must have in order not to be dropped, by default 2
  • inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:

The filtered trajectories without the minimum number of gps points or None

Return type:

DataFrame

Raises:

KeyError – If the label feature is not in the dataframe

pymove.preprocessing.filters.get_bbox_by_radius(coordinates: tuple[float, float], radius: float = 1000) → tuple[float, float, float, float][source]

Defines minimum and maximum coordinates, given a distance radius from a point.

Parameters:
  • coords (tuple (lat, lon)) – The coordinates of point
  • radius (float, optional (1000 by default)) –
Returns:

coordinates min and max of the bbox

Return type:

array

References

https://mathmesquita.me/2017/01/16/filtrando-localizacao-em-um-raio.html

pymove.preprocessing.segmentation module

Compression operations.

bbox_split, by_dist_time_speed, by_max_dist, by_max_time, by_max_speed

pymove.preprocessing.segmentation.bbox_split(bbox: tuple[int, int, int, int], number_grids: int) → DataFrame[source]

Splits the bounding box in N grids of the same size.

Parameters:
  • bbox (tuple) – Tuple of 4 elements, containing the minimum and maximum values of latitude and longitude of the bounding box.
  • number_grids (int) – Determines the number of grids to split the bounding box.
Returns:

Returns the latitude and longitude coordinates of the grids after the split.

Return type:

DataFrame

pymove.preprocessing.segmentation.by_dist_time_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_dist_between_adj_points: float = 3000, max_time_between_adj_points: float = 900, max_speed_between_adj_points: float = 50.0, drop_single_points: bool = True, label_new_tid: str = 'tid_part', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Splits the trajectories into segments based on distance, time and speed.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • max_dist_between_adj_points (float, optional) – Specify the maximum distance a point should have from the previous point, in order not to be dropped, by default 3000
  • max_time_between_adj_points (float, optional) – Specify the maximum travel time between two adjacent points, by default 900
  • max_speed_between_adj_points (float, optional) – Specify the maximum speed of travel between two adjacent points, by default 50
  • drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True
  • label_new_tid (str, optional) – The label of the column containing the ids of the formed segments. Is the new splitted id, by default TID_PART
  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

DataFrame with the aditional features: label_new_tid, that indicates the trajectory segment to which the point belongs to, by default False

Return type:

DataFrame

Note

Time, distance and speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_dist(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_dist_between_adj_points: float = 3000, drop_single_points: bool = True, label_new_tid: str = 'tid_dist', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Segments the trajectories based on distance.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • max_dist_between_adj_points (float, optional) – Specify the maximum dist between two adjacent points, by default 3000
  • drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True
  • label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_DIST Is the new splitted id.
  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

DataFrame with the aditional features: label_segment, that indicates the trajectory segment to which the point belongs to.

Return type:

DataFrame

Note

Speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_speed_between_adj_points: float = 50.0, drop_single_points: bool = True, label_new_tid: str = 'tid_speed', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Splits the trajectories into segments based on a maximum speed.

Parameters:
  • move_data (dataframe.) – The input trajectory data.
  • label_id (str, optional) – Indicates the label of the id column in the users dataframe, by default TRAJ_ID
  • max_speed_between_adj_points (float, optional) – Specify the maximum speed between two adjacent points, by default 50
  • drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True
  • label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_SPEED Is the new splitted id.
  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

DataFrame with the aditional features: label_segment, that indicates the trajectory segment to which the point belongs to

Return type:

DataFrame

Note

Speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_time(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_time_between_adj_points: float = 900.0, drop_single_points: bool = True, label_new_tid: str = 'tid_time', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Splits the trajectories into segments based on a maximum.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • label_id (str, optional) – Indicates the label of the id column in the users dataframe, by default TRAJ_ID
  • max_time_between_adj_points (float, optional) – Specify the maximum time between two adjacent points, by default 900
  • drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True
  • label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_TIME Is the new splitted id.
  • inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

DataFrame with the additional features: label_segment, that indicates the trajectory segment to which the point belongs to.

Return type:

DataFRame

Note

Speed features must be updated after split.

pymove.preprocessing.stay_point_detection module

Stop point detection operations.

create_or_update_move_stop_by_dist_time, create_or_update_move_and_stop_by_radius

pymove.preprocessing.stay_point_detection.create_or_update_move_and_stop_by_radius(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', radius: float = 0, target_label: str = 'dist_to_prev', new_label: str = 'situation', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Finds the stops and moves points of the dataframe.

If these points already exist, they will be updated.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • radius (float, optional) – The radius value is used to determine if a segment is a stop. If the value of the point in target_label is greater than radius, the segment is a stop, otherwise it’s a move, by default 0
  • target_label (String, optional) – The feature used to calculate the stay points, by default DIST_TO_PREV
  • new_label (String, optional) – Is the name of the column to indicates if a point is a stop of a move, by default SITUATION
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

dataframe with 2 aditional features: segment_stop and new_label. segment_stop indicates the trajectory segment to which the point belongs new_label indicates if the point represents a stop or moving point.

Return type:

DataFrame

pymove.preprocessing.stay_point_detection.create_or_update_move_stop_by_dist_time(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', dist_radius: float = 30, time_radius: float = 900, label_id: str = 'id', new_label: str = 'segment_stop', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]

Determines the stops and moves points of the dataframe.

If these points already exist, they will be updated.

Parameters:
  • move_data (dataframe) – The input trajectory data
  • dist_radius (float, optional) – The first step in this function is segmenting the trajectory The segments are used to find the stop points The dist_radius defines the distance used in the segmentation, by default 30
  • time_radius (float, optional) – The time_radius used to determine if a segment is a stop If the user stayed in the segment for a time greater than time_radius, than the segment is a stop, by default 900
  • label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID
  • new_label (float, optional) – Is the name of the column to indicates if a point is a stop of a move, by default SEGMENT_STOP
  • inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:

DataFrame with 2 aditional features: segment_stop and stop. segment_stop indicates the trajectory segment to which the point belongs stop indicates if the point represents a stop.

Return type:

DataFrame

Module contents

Contains functions to preprocess the dataframes for manipulation.

compression, filters, segmentation, stay_point_detection