02 - Exploring Preprocessing

Data preprocessing is a set of activities performed to prepare data for future analysis and data mining activities.

Load data from file

The dataset used in this tutorial is GeoLife GPS Trajectories. Available in https://www.microsoft.com/en-us/download/details.aspx?id=52367

from pymove import read_csv
df_move = read_csv('geolife_sample.csv')
df_move.show_trajectories_info()
df_move.head()
====================== INFORMATION ABOUT DATASET ======================

Number of Points: 217653

Number of IDs objects: 2

Start Date:2008-10-23 05:53:05     End Date:2009-03-19 05:46:37

Bounding Box:(22.147577, 113.548843, 41.132062, 121.156224)


=======================================================================
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1

Filtering

The filters module provides functions to perform different types of data filtering.

Importing the module:

from pymove import filters

A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes. The function by_bbox, filters points of the trajectories according to a chosen bounding box.

bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
filt_df = filters.by_bbox(df_move, bbox)
filt_df.head()
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1

by_datetime function filters point trajectories according to the time specified by the parameters: start_datetime and end_datetime.

filters.by_datetime(df_move, start_datetime = "2009-03-19 05:45:37", end_datetime = "2009-03-19 05:46:17")
lat lon datetime id
217643 40.000205 116.327173 2009-03-19 05:45:37 5
217644 40.000128 116.327171 2009-03-19 05:45:42 5
217645 40.000069 116.327179 2009-03-19 05:45:47 5
217646 40.000001 116.327219 2009-03-19 05:45:52 5
217647 39.999919 116.327211 2009-03-19 05:45:57 5
217648 39.999896 116.327290 2009-03-19 05:46:02 5
217649 39.999899 116.327352 2009-03-19 05:46:07 5
217650 39.999945 116.327394 2009-03-19 05:46:12 5
217651 40.000015 116.327433 2009-03-19 05:46:17 5

by label function filters trajectories points according to specified value and column label, set by value and label_name respectively.

filters.by_label(df_move, value = 116.327219, label_name = "lon").head()
lat lon datetime id
3066 39.979160 116.327219 2008-10-24 06:34:27 1
13911 39.975424 116.327219 2008-10-26 08:18:06 1
16396 39.980411 116.327219 2008-10-27 00:30:47 1
33935 39.975832 116.327219 2008-11-05 11:04:04 1
41636 39.976990 116.327219 2008-11-07 10:34:41 1

by_id function filters trajectories points according to selected trajectory id.

filters.by_id(df_move, id_=5).head()
lat lon datetime id
108607 40.004155 116.321337 2008-10-24 04:12:30 5
108608 40.003834 116.321462 2008-10-24 04:12:35 5
108609 40.003783 116.321431 2008-10-24 04:12:40 5
108610 40.003690 116.321429 2008-10-24 04:12:45 5
108611 40.003589 116.321427 2008-10-24 04:12:50 5

A tid is the result of concatenation between the id and date of a trajectory. The by_tid function filters trajectory points according to the tid specified by the tid_ parameter.

df_move.generate_tid_based_on_id_datetime()
filters.by_tid(df_move, "12008102305").head()
lat lon datetime id tid
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305

clean_consecutive_duplicates function removes consecutives duplicate rows of the Dataframe. Optionally only certaind columns can be consider, this is defined by the parameter subset, in this example only the lat column is considered.

filtered_df = filters.clean_consecutive_duplicates(df_move, subset = ["lat"])
len(filtered_df)
196142

clean_gps_jumps_by_distance function removes from the dataframe the trajectories points that are outliers.

filters.clean_gps_jumps_by_distance(df_move)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev dist_to_next dist_prev_to_next
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN 13.690153 NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 7.403788 20.223428
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 1.821083 5.888579
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 2.889671 1.873356
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 66.555997 68.727260
... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.291709 12.214590
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 6.241949 10.400206
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 8.462920 14.628012
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 4.713399 6.713456
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 NaN NaN

217270 rows × 8 columns

clean_gps_nearby_points_by_distances function removes points from the trajectories when the distance between them and the point before is smaller than the parameter radius_area.

filters.clean_gps_nearby_points_by_distances(df_move, radius_area=10)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev dist_to_next dist_prev_to_next
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN 13.690153 NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 7.403788 20.223428
5 1 39.984710 116.319865 2008-10-23 05:53:23 12008102305 66.555997 6.162987 60.622358
14 1 39.984959 116.319969 2008-10-23 05:54:03 12008102305 40.672170 11.324767 51.291054
15 1 39.985036 116.320056 2008-10-23 05:54:04 12008102305 11.324767 32.842422 24.923216
... ... ... ... ... ... ... ... ...
217563 5 40.001185 116.321791 2009-03-19 05:39:02 52009031905 11.604029 6.915583 17.245027
217637 5 40.000759 116.327088 2009-03-19 05:45:07 52009031905 28.946922 18.331999 47.148573
217638 5 40.000595 116.327066 2009-03-19 05:45:12 52009031905 18.331999 9.926875 27.905967
217641 5 40.000368 116.327072 2009-03-19 05:45:27 52009031905 10.877438 8.887992 19.705708
217643 5 40.000205 116.327173 2009-03-19 05:45:37 52009031905 11.406650 8.563704 19.107146

79969 rows × 8 columns

clean_gps_nearby_points_by_speed function removes points from the trajectories when the speed of travel between them and the point before is smaller than the value set by the parameter speed_radius.

filters.clean_gps_nearby_points_by_speed(df_move, speed_radius=40.0)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN
149 1 39.977648 116.326925 2008-10-23 10:33:00 12008102310 1470.641291 7.0 210.091613
560 1 40.009802 116.313247 2008-10-23 10:56:54 12008102310 47.020950 1.0 47.020950
561 1 40.009262 116.312948 2008-10-23 10:56:55 12008102310 65.222058 1.0 65.222058
1369 1 39.990659 116.326345 2008-10-24 00:04:29 12008102400 40.942759 1.0 40.942759
... ... ... ... ... ... ... ... ...
216382 5 40.000185 116.327286 2009-02-28 03:52:45 52009022803 333.656648 5.0 66.731330
217458 5 39.999918 116.320057 2009-03-19 04:36:02 52009031904 556.947064 5.0 111.389413
217459 5 39.999077 116.317156 2009-03-19 04:36:07 52009031904 264.212540 5.0 52.842508
217463 5 40.001122 116.320879 2009-03-19 04:40:52 52009031904 267.350055 5.0 53.470011
217476 5 40.005903 116.318669 2009-03-19 04:49:47 52009031904 436.405009 5.0 87.281002

281 rows × 8 columns

clean_gps_speed_max_radius function recursively removes trajectories points with speed higher than the value set by the user.

filters.clean_gps_speed_max_radius(df_move)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934
... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670

217304 rows × 8 columns

clean_trajectories_with_few_points function removes from the given dataframe, trajectories with fewer points than was specified by the parameter min_points_per_trajectory.

filters.clean_trajectories_with_few_points(df_move)
lat lon datetime id tid
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305
... ... ... ... ... ...
217648 39.999896 116.327290 2009-03-19 05:46:02 5 52009031905
217649 39.999899 116.327352 2009-03-19 05:46:07 5 52009031905
217650 39.999945 116.327394 2009-03-19 05:46:12 5 52009031905
217651 40.000015 116.327433 2009-03-19 05:46:17 5 52009031905
217652 39.999978 116.327460 2009-03-19 05:46:37 5 52009031905

217649 rows × 5 columns

Segmentation

The segmentation module are used to segment trajectories based on different parameters.

Importing the module:

from pymove import segmentation

bbox_split function splits the bounding box in grids of the same size. The number of grids is defined by the parameter number_grids.

bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
segmentation.bbox_split(bbox, number_grids=4)
lat_min lon_min lat_max lon_max
0 22.147577 113.548843 41.132062 115.450688
1 22.147577 115.450688 41.132062 117.352533
2 22.147577 117.352533 41.132062 119.254379
3 22.147577 119.254379 41.132062 121.156224

by_dist_time_speed functions segments the trajectories into clusters based on distance, time and speed.

segmentation.by_dist_time_speed(
    df_move,
    max_dist_between_adj_points=5000,
    max_time_between_adj_points=800,
    max_speed_between_adj_points=60.0
)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev tid_part
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN 1
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153 1
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758 1
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217 1
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934 1
... ... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771 515
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342 515
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390 515
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584 515
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670 515

217653 rows × 9 columns

by_max_dist function segments the trajectories into clusters based on distance.

segmentation.by_max_dist(df_move, max_dist_between_adj_points=4000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev tid_dist
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN 1
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153 1
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758 1
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217 1
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934 1
... ... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771 20
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342 20
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390 20
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584 20
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670 20

217653 rows × 9 columns

by_max_time function segments the trajectories into clusters based on time.

segmentation.by_max_time(df_move, max_time_between_adj_points=1000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev tid_time
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN 1
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153 1
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758 1
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217 1
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934 1
... ... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771 353
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342 353
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390 353
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584 353
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670 353

217653 rows × 9 columns

by_max_speed function segments the trajectories into clusters based on speed.

segmentation.by_max_speed(df_move, max_speed_between_adj_points=70.0)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev tid_speed
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN 1
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153 1
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758 1
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217 1
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934 1
... ... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771 86
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342 86
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390 86
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584 86
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670 86

217653 rows × 9 columns

Stay point detection

A stay point is location where a moving object has stayed for a while within a certain distance threshold. A stay point could stand different places such: a restaurant, a school, a work place.

Importing the module:

from pymove import stay_point_detection

create_or_update_move_stop_by_dist_time function creates or updates the stay points of the trajectories, based on distance and time metrics.

stay_point_detection.create_or_update_move_stop_by_dist_time(df_move, dist_radius=40, time_radius=1000)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=3512)))
segment_stop id lat lon datetime tid dist_to_prev time_to_prev speed_to_prev stop
0 1 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN NaN NaN False
1 1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 1.0 13.690153 False
2 1 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 5.0 1.480758 False
3 1 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 5.0 0.364217 False
4 1 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 5.0 0.577934 False
... ... ... ... ... ... ... ... ... ... ...
217648 3512 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.0 1.439771 False
217649 3512 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 5.0 1.058342 False
217650 3512 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 5.0 1.248390 False
217651 3512 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 5.0 1.692584 False
217652 3512 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 20.0 0.235670 False

217653 rows × 10 columns

create_or_update_move_and_stop_by_radius function creates or updates the stay points of the trajectories, based on distance.

stay_point_detection.create_or_update_move_and_stop_by_radius(df_move, radius=2)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime tid dist_to_prev dist_to_next dist_prev_to_next situation
0 1 39.984094 116.319236 2008-10-23 05:53:05 12008102305 NaN 13.690153 NaN nan
1 1 39.984198 116.319322 2008-10-23 05:53:06 12008102305 13.690153 7.403788 20.223428 move
2 1 39.984224 116.319402 2008-10-23 05:53:11 12008102305 7.403788 1.821083 5.888579 move
3 1 39.984211 116.319389 2008-10-23 05:53:16 12008102305 1.821083 2.889671 1.873356 stop
4 1 39.984217 116.319422 2008-10-23 05:53:21 12008102305 2.889671 66.555997 68.727260 move
... ... ... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 52009031905 7.198855 5.291709 12.214590 move
217649 5 39.999899 116.327352 2009-03-19 05:46:07 52009031905 5.291709 6.241949 10.400206 move
217650 5 39.999945 116.327394 2009-03-19 05:46:12 52009031905 6.241949 8.462920 14.628012 move
217651 5 40.000015 116.327433 2009-03-19 05:46:17 52009031905 8.462920 4.713399 6.713456 move
217652 5 39.999978 116.327460 2009-03-19 05:46:37 52009031905 4.713399 NaN NaN move

217653 rows × 9 columns

Compression

Importing the module:

from pymove import compression

The function below is used to reduce the size of the trajectory, the stop points are used to make the compression.

df_compressed = compression.compress_segment_stop_to_point(df_move)
len(df_move), len(df_compressed)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=4809)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=285)))
(217653, 65620)