pymove.preprocessing package¶

Submodules¶

pymove.preprocessing.compression module¶

Compression operations.

compress_segment_stop_to_point

pymove.preprocessing.compression.compress_segment_stop_to_point(move_data: pandas.core.frame.DataFrame, label_segment: str = 'segment_stop', label_stop: str = 'stop', point_mean: str = 'default', drop_moves: bool = False, label_id: str = 'id', dist_radius: float = 30, time_radius: float = 900, inplace: bool = False) → pandas.core.frame.DataFrame[source]¶

Compress the trajectories using the stop points in the dataframe.

Compress a segment to point setting lat_mean e lon_mean to each segment.

Parameters:

move_data (dataframe) – The input trajectory data
label_segment (String, optional) – The label of the column containing the ids of the formed segments. Is the new splitted id, by default SEGMENT_STOP
label_stop (String, optional) – Is the name of the column that indicates if a point is a stop, by default STOP
point_mean (String, optional) – Indicates whether the mean points should be calculated using centroids or the point that repeat the most, by default ‘default’
drop_moves (Boolean, optional) – If set to true, the moving points will be dropped from the dataframe, by default False
label_id (String, optional) – Used to create the stay points used in the compression. If the dataset already has the stop move, this parameter should be ignored. Indicates the label of the id column in the user dataframe, by default TRAJ_ID
dist_radius (Double, optional) – Used to create the stay points used in the compression, by default 30 If the dataset already has the stop move, this parameter should be ignored. The first step in this function is segmenting the trajectory. The segments are used to find the stop points. The dist_radius defines the distance used in the segmentation.
time_radius (Double, optional) –
Used to create the stay points used in the compression, by default 900 If the dataset already has the stop move, this

parameter should be ignored.

The time_radius used to determine if a segment is a stop. If the user stayed in the segment for a time greater than time_radius, than the segment is a stop.
inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False

Returns:

Data with 3 additional features: segment_stop, lat_mean and lon_mean or None segment_stop indicates the trajectory segment to which the point belongs lat_mean and lon_mean:

if the default option is used, lat_mean and lon_mean are defined based on point that repeats most within the segment On the other hand, if centroid option is used, lat_mean and lon_mean are defined by centroid of the all points into segment

Return type:

DataFrame

pymove.preprocessing.filters module¶

Filtering operations.

get_bbox_by_radius, by_bbox, by_datetime, by_label, by_id, by_tid, clean_consecutive_duplicates, clean_gps_jumps_by_distance, clean_gps_nearby_points_by_distances, clean_gps_nearby_points_by_speed, clean_gps_speed_max_radius, clean_trajectories_with_few_points, clean_trajectories_short_and_few_points, clean_id_by_time_max

pymove.preprocessing.filters.by_bbox(move_data: DataFrame, bbox: tuple[float, float, float, float], filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]¶

Filters points of the trajectories according to specified bounding box.

Parameters:	move_data (dataframe) – The input trajectories data bbox (tuple) – Tuple of 4 elements, containing the minimum and maximum values of latitude and longitude of the bounding box. filter_out (boolean, optional) – If set to false the function will return the trajectories points within the bounding box, and the points outside otherwise, by default False inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	Returns dataframe with trajectories points filtered by bounding box or None
Return type:	DataFrame

pymove.preprocessing.filters.by_datetime(move_data: DataFrame, start_datetime: str | None = None, end_datetime: str | None = None, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]¶

Filters trajectories points according to specified time range.

Parameters:	move_data (dataframe) – The input trajectory data start_datetime (str) – The start date and time (Datetime format) of the time range, by default None end_datetime (str) – The end date and time (Datetime format) of the time range, by default None filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	Returns dataframe with trajectories points filtered by time range or None
Return type:	DataFrame

pymove.preprocessing.filters.by_id(move_data: DataFrame, id_: int | None = None, label_id: str = 'id', filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]¶

Filters trajectories points according to specified trajectory id.

Parameters:	move_data (dataframe) – The input trajectory data id (int) – Specifies the number of the id used to filter the trajectories points label_id (str, optional) – The label of the column which contains the id of the trajectories, by default TRAJ_ID filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	Returns dataframe with trajectories points filtered by id or None
Return type:	DataFrame

pymove.preprocessing.filters.by_label(move_data: DataFrame, value: Any, label_name: str, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]¶

Filters trajectories points according to specified value and column label.

Parameters:	move_data (dataframe) – The input trajectory data value (The value to be use to filter the trajectories) – Specifies the value used to filter the trajectories points label_name (str) – Specifies the label of the column used in the filtering filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	Returns dataframe with trajectories points filtered by label or None
Return type:	DataFrame

pymove.preprocessing.filters.by_tid(move_data: DataFrame, tid_: str | None = None, filter_out: bool = False, inplace: bool = False) → DataFrame | None[source]¶

Filters trajectories points according to a specified trajectory tid.

Parameters:	move_data (dataframe) – The input trajectory data tid (str) – Specifies the number of the tid used to filter the trajectories points label_tid (str, optional) – The label of the column in the user dataframe which contains the tid of the trajectories, by default None filter_out (bool, optional) – If set to true, the function will return the points of the trajectories with timestamp outside the time range. The points whithin the time range will be return if filter_out is False. by default False inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	Returns a dataframe with trajectories points filtered or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_consecutive_duplicates(move_data: DataFrame, subset: int | str | None = None, keep: str | bool = 'first', inplace: bool = False) → DataFrame | None[source]¶

Removes consecutive duplicate rows of the Dataframe.

Optionally only certain columns can be consider.

Parameters:	move_data (dataframe) – The input trajectory data subset (Array of str, optional) – Specifies Column label or sequence of labels, considered for identifying duplicates, by default None keep ('first', 'last', optional) – If keep is set as first, all the duplicates except for the first occurrence will be dropped. On the other hand if set to last, all duplicates except for the last occurrence will be dropped. If set to False, all duplicates are dropped. by default ‘first’ inplace (boolean, optional) – if set to true the original dataframe will be altered, the duplicates will be dropped in place, otherwise a copy will be returned, by default False
Returns:	The filtered trajectories points without consecutive duplicates or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_gps_jumps_by_distance(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', jump_coefficient: float = 3.0, threshold: float = 1, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Removes the trajectories points that are outliers from the dataframe.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID jump_coefficient (float, optional) – by default 3 threshold (float, optional) – Minimum value that the distance features must have in order to be considered outliers, by default 1 label_dtype (type, optional) – Represents column id type, by default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories without the gps jumps or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_gps_nearby_points_by_distances(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', radius_area: float = 10.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Removes points from the trajectories with smaller distance from the point before.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID radius_area (float, optional) – Species the minimum distance a point must have to it”srs previous point in order not to be dropped, by default 10 label_dtype (type, optional) – Represents column id type, ,y default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, be default False
Returns:	The filtered trajectories without the gps nearby points by distance or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_gps_nearby_points_by_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', speed_radius: float = 0.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Removes points from the trajectories with smaller speed of travel.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, be defalt TRAJ_ID speed_radius (float, optional) – Species the minimum speed a point must have from it”srs previous point, in order not to be dropped, by default 0 label_dtype (type, optional) – Represents column id type, by default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories without the gps nearby points by speed or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_gps_speed_max_radius(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', speed_max: float = 50.0, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Removes trajectories points with higher speed.

Given any point p of the trajectory, the point will be removed if one of the following happens: if the travel speed from the point before p to p is greater than the max value of speed between adjacent points set by the user. Or the travel speed between point p and the next point is greater than the value set by the user. When the cleaning is done, the function will update the time and distance features in the dataframe and will call itself again. The function will finish processing when it can no longer find points disrespecting the limit of speed.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID speed_max (float, optional) – Indicates the maximum value a point speed_to_prev and speed_to_next should have, in order not to be dropped, by default 50 label_dtype (type, optional) – Represents column id type, by default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories without the gps nearby points or None
Return type:	DataFrame

pymove.preprocessing.filters.clean_id_by_time_max(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', time_max: float = 3600, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Clears GPS points with time by ID greater than a user-defined limit.

Parameters:	move_data (dataframe.) – The input data. label_id (str, optional) – The label of the column which contains the id of the trajectories, by default TRAJ_ID time_max (float, optional) – Indicates the maximum value time a set of points with the same id should have in order not to be dropped, by default 3600 label_dtype (type, optional) – Represents column id type, by default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories with the maximum time.
Return type:	dataframe or None

pymove.preprocessing.filters.clean_trajectories_short_and_few_points(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'tid', min_trajectory_distance: float = 100, min_points_per_trajectory: int = 2, label_dtype: Callable = <class 'numpy.float64'>, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Eliminates from the given dataframe trajectories with fewer points and shorter length.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – The label of the column which contains the tid of the trajectories, by default TID min_trajectory_distance (float, optional) – Specifies the minimun length a trajectory must have in order not to be dropped, by default 100 min_points_per_trajectory (int, optional) – Specifies the minimun number of points a trajectory must have in order not to be dropped, by default 2 label_dtype (type, optional) – Represents column id type, by default np.float64. inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories with the minimum gps points and distance or None
Return type:	DataFrame

Notes

remove_tids_with_few_points must be performed before updating features.

pymove.preprocessing.filters.clean_trajectories_with_few_points(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_tid: str = 'tid', min_points_per_trajectory: int = 2, inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Removes from the given dataframe, trajectories with fewer points.

Parameters:	move_data (dataframe) – The input trajectory data label_tid (str, optional) – The label of the column which contains the tid of the trajectories, by default TID min_points_per_trajectory (int, optional) – Specifies the minimum number of points a trajectory must have in order not to be dropped, by default 2 inplace (boolean, optional) – if set to true the operation is done in place, the original dataframe will be altered and None is returned, by default False
Returns:	The filtered trajectories without the minimum number of gps points or None
Return type:	DataFrame
Raises:	`KeyError` – If the label feature is not in the dataframe

pymove.preprocessing.filters.get_bbox_by_radius(coordinates: tuple[float, float], radius: float = 1000) → tuple[float, float, float, float][source]¶

Defines minimum and maximum coordinates, given a distance radius from a point.

Parameters:	coords (tuple (lat, lon)) – The coordinates of point radius (float, optional (1000 by default)) –
Returns:	coordinates min and max of the bbox
Return type:	array

References

https://mathmesquita.me/2017/01/16/filtrando-localizacao-em-um-raio.html

pymove.preprocessing.segmentation module¶

Compression operations.

bbox_split, by_dist_time_speed, by_max_dist, by_max_time, by_max_speed

pymove.preprocessing.segmentation.bbox_split(bbox: tuple[int, int, int, int], number_grids: int) → DataFrame[source]¶

Splits the bounding box in N grids of the same size.

Parameters:	bbox (tuple) – Tuple of 4 elements, containing the minimum and maximum values of latitude and longitude of the bounding box. number_grids (int) – Determines the number of grids to split the bounding box.
Returns:	Returns the latitude and longitude coordinates of the grids after the split.
Return type:	DataFrame

pymove.preprocessing.segmentation.by_dist_time_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_dist_between_adj_points: float = 3000, max_time_between_adj_points: float = 900, max_speed_between_adj_points: float = 50.0, drop_single_points: bool = True, label_new_tid: str = 'tid_part', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Splits the trajectories into segments based on distance, time and speed.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID max_dist_between_adj_points (float, optional) – Specify the maximum distance a point should have from the previous point, in order not to be dropped, by default 3000 max_time_between_adj_points (float, optional) – Specify the maximum travel time between two adjacent points, by default 900 max_speed_between_adj_points (float, optional) – Specify the maximum speed of travel between two adjacent points, by default 50 drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True label_new_tid (str, optional) – The label of the column containing the ids of the formed segments. Is the new splitted id, by default TID_PART inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	DataFrame with the aditional features: label_new_tid, that indicates the trajectory segment to which the point belongs to, by default False
Return type:	DataFrame

Note

Time, distance and speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_dist(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_dist_between_adj_points: float = 3000, drop_single_points: bool = True, label_new_tid: str = 'tid_dist', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Segments the trajectories based on distance.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID max_dist_between_adj_points (float, optional) – Specify the maximum dist between two adjacent points, by default 3000 drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_DIST Is the new splitted id. inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	DataFrame with the aditional features: label_segment, that indicates the trajectory segment to which the point belongs to.
Return type:	DataFrame

Note

Speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_speed(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_speed_between_adj_points: float = 50.0, drop_single_points: bool = True, label_new_tid: str = 'tid_speed', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Splits the trajectories into segments based on a maximum speed.

Parameters:	move_data (dataframe.) – The input trajectory data. label_id (str, optional) – Indicates the label of the id column in the users dataframe, by default TRAJ_ID max_speed_between_adj_points (float, optional) – Specify the maximum speed between two adjacent points, by default 50 drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_SPEED Is the new splitted id. inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	DataFrame with the aditional features: label_segment, that indicates the trajectory segment to which the point belongs to
Return type:	DataFrame

Note

Speed features must be updated after split.

pymove.preprocessing.segmentation.by_max_time(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', label_id: str = 'id', max_time_between_adj_points: float = 900.0, drop_single_points: bool = True, label_new_tid: str = 'tid_time', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Splits the trajectories into segments based on a maximum.

Parameters:	move_data (dataframe) – The input trajectory data label_id (str, optional) – Indicates the label of the id column in the users dataframe, by default TRAJ_ID max_time_between_adj_points (float, optional) – Specify the maximum time between two adjacent points, by default 900 drop_single_points (boolean, optional) – If set to True, drops the trajectories with only one point, by default True label_new_tid (str, optional) – The label of the column containing the ids of the formed segments, by default TID_TIME Is the new splitted id. inplace (boolean, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	DataFrame with the additional features: label_segment, that indicates the trajectory segment to which the point belongs to.
Return type:	DataFRame

Note

Speed features must be updated after split.

pymove.preprocessing.stay_point_detection module¶

Stop point detection operations.

create_or_update_move_stop_by_dist_time, create_or_update_move_and_stop_by_radius

pymove.preprocessing.stay_point_detection.create_or_update_move_and_stop_by_radius(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', radius: float = 0, target_label: str = 'dist_to_prev', new_label: str = 'situation', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Finds the stops and moves points of the dataframe.

If these points already exist, they will be updated.

Parameters:	move_data (dataframe) – The input trajectory data radius (float, optional) – The radius value is used to determine if a segment is a stop. If the value of the point in target_label is greater than radius, the segment is a stop, otherwise it’s a move, by default 0 target_label (String, optional) – The feature used to calculate the stay points, by default DIST_TO_PREV new_label (String, optional) – Is the name of the column to indicates if a point is a stop of a move, by default SITUATION inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	dataframe with 2 aditional features: segment_stop and new_label. segment_stop indicates the trajectory segment to which the point belongs new_label indicates if the point represents a stop or moving point.
Return type:	DataFrame

pymove.preprocessing.stay_point_detection.create_or_update_move_stop_by_dist_time(move_data: 'PandasMoveDataFrame' | 'DaskMoveDataFrame', dist_radius: float = 30, time_radius: float = 900, label_id: str = 'id', new_label: str = 'segment_stop', inplace: bool = False) → 'PandasMoveDataFrame' | 'DaskMoveDataFrame' | None[source]¶

Determines the stops and moves points of the dataframe.

If these points already exist, they will be updated.

Parameters:	move_data (dataframe) – The input trajectory data dist_radius (float, optional) – The first step in this function is segmenting the trajectory The segments are used to find the stop points The dist_radius defines the distance used in the segmentation, by default 30 time_radius (float, optional) – The time_radius used to determine if a segment is a stop If the user stayed in the segment for a time greater than time_radius, than the segment is a stop, by default 900 label_id (str, optional) – Indicates the label of the id column in the user dataframe, by default TRAJ_ID new_label (float, optional) – Is the name of the column to indicates if a point is a stop of a move, by default SEGMENT_STOP inplace (bool, optional) – if set to true the original dataframe will be altered to contain the result of the filtering, otherwise a copy will be returned, by default False
Returns:	DataFrame with 2 aditional features: segment_stop and stop. segment_stop indicates the trajectory segment to which the point belongs stop indicates if the point represents a stop.
Return type:	DataFrame

Module contents¶

Contains functions to preprocess the dataframes for manipulation.

compression, filters, segmentation, stay_point_detection