02 - Exploring Preprocessing¶
Data preprocessing is a set of activities performed to prepare data for future analysis and data mining activities.
Load data from file¶
The dataset used in this tutorial is GeoLife GPS Trajectories. Available in https://www.microsoft.com/en-us/download/details.aspx?id=52367
from pymove import read_csv
df_move = read_csv('geolife_sample.csv')
df_move.show_trajectories_info()
df_move.head()
====================== INFORMATION ABOUT DATASET ======================
Number of Points: 217653
Number of IDs objects: 2
Start Date:2008-10-23 05:53:05 End Date:2009-03-19 05:46:37
Bounding Box:(22.147577, 113.548843, 41.132062, 121.156224)
=======================================================================
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
Filtering¶
The filters module provides functions to perform different types of data filtering.
Importing the module:
from pymove import filters
A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes. The function by_bbox, filters points of the trajectories according to a chosen bounding box.
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
filt_df = filters.by_bbox(df_move, bbox)
filt_df.head()
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
by_datetime function filters point trajectories according to the time specified by the parameters: start_datetime and end_datetime.
filters.by_datetime(df_move, start_datetime = "2009-03-19 05:45:37", end_datetime = "2009-03-19 05:46:17")
lat | lon | datetime | id | |
---|---|---|---|---|
217643 | 40.000205 | 116.327173 | 2009-03-19 05:45:37 | 5 |
217644 | 40.000128 | 116.327171 | 2009-03-19 05:45:42 | 5 |
217645 | 40.000069 | 116.327179 | 2009-03-19 05:45:47 | 5 |
217646 | 40.000001 | 116.327219 | 2009-03-19 05:45:52 | 5 |
217647 | 39.999919 | 116.327211 | 2009-03-19 05:45:57 | 5 |
217648 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 5 |
217649 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 5 |
217650 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 5 |
217651 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 5 |
by label function filters trajectories points according to specified value and column label, set by value and label_name respectively.
filters.by_label(df_move, value = 116.327219, label_name = "lon").head()
lat | lon | datetime | id | |
---|---|---|---|---|
3066 | 39.979160 | 116.327219 | 2008-10-24 06:34:27 | 1 |
13911 | 39.975424 | 116.327219 | 2008-10-26 08:18:06 | 1 |
16396 | 39.980411 | 116.327219 | 2008-10-27 00:30:47 | 1 |
33935 | 39.975832 | 116.327219 | 2008-11-05 11:04:04 | 1 |
41636 | 39.976990 | 116.327219 | 2008-11-07 10:34:41 | 1 |
by_id function filters trajectories points according to selected trajectory id.
filters.by_id(df_move, id_=5).head()
lat | lon | datetime | id | |
---|---|---|---|---|
108607 | 40.004155 | 116.321337 | 2008-10-24 04:12:30 | 5 |
108608 | 40.003834 | 116.321462 | 2008-10-24 04:12:35 | 5 |
108609 | 40.003783 | 116.321431 | 2008-10-24 04:12:40 | 5 |
108610 | 40.003690 | 116.321429 | 2008-10-24 04:12:45 | 5 |
108611 | 40.003589 | 116.321427 | 2008-10-24 04:12:50 | 5 |
A tid is the result of concatenation between the id and date of a trajectory. The by_tid function filters trajectory points according to the tid specified by the tid_ parameter.
df_move.generate_tid_based_on_id_datetime()
filters.by_tid(df_move, "12008102305").head()
lat | lon | datetime | id | tid | |
---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 |
clean_consecutive_duplicates function removes consecutives duplicate rows of the Dataframe. Optionally only certaind columns can be consider, this is defined by the parameter subset, in this example only the lat column is considered.
filtered_df = filters.clean_consecutive_duplicates(df_move, subset = ["lat"])
len(filtered_df)
196142
clean_gps_jumps_by_distance function removes from the dataframe the trajectories points that are outliers.
filters.clean_gps_jumps_by_distance(df_move)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | dist_to_next | dist_prev_to_next | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | 13.690153 | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 7.403788 | 20.223428 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 1.821083 | 5.888579 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 2.889671 | 1.873356 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 66.555997 | 68.727260 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.291709 | 12.214590 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 6.241949 | 10.400206 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 8.462920 | 14.628012 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 4.713399 | 6.713456 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | NaN | NaN |
217270 rows × 8 columns
clean_gps_nearby_points_by_distances function removes points from the trajectories when the distance between them and the point before is smaller than the parameter radius_area.
filters.clean_gps_nearby_points_by_distances(df_move, radius_area=10)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | dist_to_next | dist_prev_to_next | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | 13.690153 | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 7.403788 | 20.223428 |
5 | 1 | 39.984710 | 116.319865 | 2008-10-23 05:53:23 | 12008102305 | 66.555997 | 6.162987 | 60.622358 |
14 | 1 | 39.984959 | 116.319969 | 2008-10-23 05:54:03 | 12008102305 | 40.672170 | 11.324767 | 51.291054 |
15 | 1 | 39.985036 | 116.320056 | 2008-10-23 05:54:04 | 12008102305 | 11.324767 | 32.842422 | 24.923216 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
217563 | 5 | 40.001185 | 116.321791 | 2009-03-19 05:39:02 | 52009031905 | 11.604029 | 6.915583 | 17.245027 |
217637 | 5 | 40.000759 | 116.327088 | 2009-03-19 05:45:07 | 52009031905 | 28.946922 | 18.331999 | 47.148573 |
217638 | 5 | 40.000595 | 116.327066 | 2009-03-19 05:45:12 | 52009031905 | 18.331999 | 9.926875 | 27.905967 |
217641 | 5 | 40.000368 | 116.327072 | 2009-03-19 05:45:27 | 52009031905 | 10.877438 | 8.887992 | 19.705708 |
217643 | 5 | 40.000205 | 116.327173 | 2009-03-19 05:45:37 | 52009031905 | 11.406650 | 8.563704 | 19.107146 |
79969 rows × 8 columns
clean_gps_nearby_points_by_speed function removes points from the trajectories when the speed of travel between them and the point before is smaller than the value set by the parameter speed_radius.
filters.clean_gps_nearby_points_by_speed(df_move, speed_radius=40.0)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN |
149 | 1 | 39.977648 | 116.326925 | 2008-10-23 10:33:00 | 12008102310 | 1470.641291 | 7.0 | 210.091613 |
560 | 1 | 40.009802 | 116.313247 | 2008-10-23 10:56:54 | 12008102310 | 47.020950 | 1.0 | 47.020950 |
561 | 1 | 40.009262 | 116.312948 | 2008-10-23 10:56:55 | 12008102310 | 65.222058 | 1.0 | 65.222058 |
1369 | 1 | 39.990659 | 116.326345 | 2008-10-24 00:04:29 | 12008102400 | 40.942759 | 1.0 | 40.942759 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
216382 | 5 | 40.000185 | 116.327286 | 2009-02-28 03:52:45 | 52009022803 | 333.656648 | 5.0 | 66.731330 |
217458 | 5 | 39.999918 | 116.320057 | 2009-03-19 04:36:02 | 52009031904 | 556.947064 | 5.0 | 111.389413 |
217459 | 5 | 39.999077 | 116.317156 | 2009-03-19 04:36:07 | 52009031904 | 264.212540 | 5.0 | 52.842508 |
217463 | 5 | 40.001122 | 116.320879 | 2009-03-19 04:40:52 | 52009031904 | 267.350055 | 5.0 | 53.470011 |
217476 | 5 | 40.005903 | 116.318669 | 2009-03-19 04:49:47 | 52009031904 | 436.405009 | 5.0 | 87.281002 |
281 rows × 8 columns
clean_gps_speed_max_radius function recursively removes trajectories points with speed higher than the value set by the user.
filters.clean_gps_speed_max_radius(df_move)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 |
217304 rows × 8 columns
clean_trajectories_with_few_points function removes from the given dataframe, trajectories with fewer points than was specified by the parameter min_points_per_trajectory.
filters.clean_trajectories_with_few_points(df_move)
lat | lon | datetime | id | tid | |
---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 |
... | ... | ... | ... | ... | ... |
217648 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 5 | 52009031905 |
217649 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 5 | 52009031905 |
217650 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 5 | 52009031905 |
217651 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 5 | 52009031905 |
217652 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 5 | 52009031905 |
217649 rows × 5 columns
Segmentation¶
The segmentation module are used to segment trajectories based on different parameters.
Importing the module:
from pymove import segmentation
bbox_split function splits the bounding box in grids of the same size. The number of grids is defined by the parameter number_grids.
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
segmentation.bbox_split(bbox, number_grids=4)
lat_min | lon_min | lat_max | lon_max | |
---|---|---|---|---|
0 | 22.147577 | 113.548843 | 41.132062 | 115.450688 |
1 | 22.147577 | 115.450688 | 41.132062 | 117.352533 |
2 | 22.147577 | 117.352533 | 41.132062 | 119.254379 |
3 | 22.147577 | 119.254379 | 41.132062 | 121.156224 |
by_dist_time_speed functions segments the trajectories into clusters based on distance, time and speed.
segmentation.by_dist_time_speed(
df_move,
max_dist_between_adj_points=5000,
max_time_between_adj_points=800,
max_speed_between_adj_points=60.0
)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | tid_part | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN | 1 |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 | 1 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 | 1 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 | 1 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 | 515 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 | 515 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 | 515 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 | 515 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 | 515 |
217653 rows × 9 columns
by_max_dist function segments the trajectories into clusters based on distance.
segmentation.by_max_dist(df_move, max_dist_between_adj_points=4000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | tid_dist | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN | 1 |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 | 1 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 | 1 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 | 1 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 | 20 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 | 20 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 | 20 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 | 20 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 | 20 |
217653 rows × 9 columns
by_max_time function segments the trajectories into clusters based on time.
segmentation.by_max_time(df_move, max_time_between_adj_points=1000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | tid_time | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN | 1 |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 | 1 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 | 1 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 | 1 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 | 353 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 | 353 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 | 353 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 | 353 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 | 353 |
217653 rows × 9 columns
by_max_speed function segments the trajectories into clusters based on speed.
segmentation.by_max_speed(df_move, max_speed_between_adj_points=70.0)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | tid_speed | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN | 1 |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 | 1 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 | 1 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 | 1 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 | 86 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 | 86 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 | 86 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 | 86 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 | 86 |
217653 rows × 9 columns
Stay point detection¶
A stay point is location where a moving object has stayed for a while within a certain distance threshold. A stay point could stand different places such: a restaurant, a school, a work place.
Importing the module:
from pymove import stay_point_detection
create_or_update_move_stop_by_dist_time function creates or updates the stay points of the trajectories, based on distance and time metrics.
stay_point_detection.create_or_update_move_stop_by_dist_time(df_move, dist_radius=40, time_radius=1000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=3512)))
segment_stop | id | lat | lon | datetime | tid | dist_to_prev | time_to_prev | speed_to_prev | stop | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | NaN | NaN | False |
1 | 1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 1.0 | 13.690153 | False |
2 | 1 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 5.0 | 1.480758 | False |
3 | 1 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 5.0 | 0.364217 | False |
4 | 1 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 5.0 | 0.577934 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 3512 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.0 | 1.439771 | False |
217649 | 3512 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 5.0 | 1.058342 | False |
217650 | 3512 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 5.0 | 1.248390 | False |
217651 | 3512 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 5.0 | 1.692584 | False |
217652 | 3512 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | 20.0 | 0.235670 | False |
217653 rows × 10 columns
create_or_update_move_and_stop_by_radius function creates or updates the stay points of the trajectories, based on distance.
stay_point_detection.create_or_update_move_and_stop_by_radius(df_move, radius=2)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | tid | dist_to_prev | dist_to_next | dist_prev_to_next | situation | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 12008102305 | NaN | 13.690153 | NaN | nan |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 12008102305 | 13.690153 | 7.403788 | 20.223428 | move |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 12008102305 | 7.403788 | 1.821083 | 5.888579 | move |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 12008102305 | 1.821083 | 2.889671 | 1.873356 | stop |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 12008102305 | 2.889671 | 66.555997 | 68.727260 | move |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 52009031905 | 7.198855 | 5.291709 | 12.214590 | move |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 52009031905 | 5.291709 | 6.241949 | 10.400206 | move |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 52009031905 | 6.241949 | 8.462920 | 14.628012 | move |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 52009031905 | 8.462920 | 4.713399 | 6.713456 | move |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 52009031905 | 4.713399 | NaN | NaN | move |
217653 rows × 9 columns
Compression¶
Importing the module:
from pymove import compression
The function below is used to reduce the size of the trajectory, the stop points are used to make the compression.
df_compressed = compression.compress_segment_stop_to_point(df_move)
len(df_move), len(df_compressed)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=4809)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=285)))
(217653, 65620)