01 - Exploring MoveDataFrame

To work with Pymove you need to import the data into our data structure: MoveDataFrame!

MoveDataFrame is an abstraction that instantiates a new data structure that manipulates the structure the user wants. This is done using the Factory Method design pattern. This structure allows the interface to be implemented using different representations and libraries that manipulate the data.

We have an interface that delimits the scope that new implementing classes should have. We currently have two concrete classes that implement this interface: PandasMoveDataFrame and DaskMoveDataFrame (under construction), which use Pandas and Dask respectively for data manipulation.

It works like this: The user instantiating a MoveDataFrame provides a flag telling which library they want to use for manipulating this data.

Now that we understand the concept and data structure of PyMove, hands on!


MoveDataFrame

A MoveDataFrame must contain the columns: - lat: represents the latitude of the point. - lon: represents the longitude of the point. - datetime: represents the date and time of the point.

In addition, the user can enter several other columns as trajectory id. If the id is not entered, the points are supposed to belong to the same path.


Creating a MoveDataFrame

A MoveDataFrame can be created by passing a Pandas DataFrame, a list, dict or even reading a file. Look:

import pymove as pm
from pymove import MoveDataFrame

From a list

list_data = [
    [39.984094, 116.319236, '2008-10-23 05:53:05', 1],
    [39.984198, 116.319322, '2008-10-23 05:53:06', 1],
    [39.984224, 116.319402, '2008-10-23 05:53:11', 1],
    [39.984224, 116.319402, '2008-10-23 05:53:11', 1],
    [39.984224, 116.319402, '2008-10-23 05:53:11', 1],
    [39.984224, 116.319402, '2008-10-23 05:53:11', 1]
]
move_df = MoveDataFrame(data=list_data, latitude="lat", longitude="lon", datetime="datetime", traj_id="id")
move_df.head()
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984224 116.319402 2008-10-23 05:53:11 1
4 39.984224 116.319402 2008-10-23 05:53:11 1

From a dict

dict_data = {
    'lat': [39.984198, 39.984224, 39.984094],
    'lon': [116.319402, 116.319322, 116.319402],
    'datetime': ['2008-10-23 05:53:11', '2008-10-23 05:53:06', '2008-10-23 05:53:06']
}

move_df = MoveDataFrame(data=dict_data, latitude="lat", longitude="lon", datetime="datetime", traj_id="id")
move_df.head()
lat lon datetime
0 39.984198 116.319402 2008-10-23 05:53:11
1 39.984224 116.319322 2008-10-23 05:53:06
2 39.984094 116.319402 2008-10-23 05:53:06

From a DataFrame Pandas

import pandas as pd

df = pd.read_csv('geolife_sample.csv', parse_dates=['datetime'])
move_df = MoveDataFrame(data=df, latitude="lat", longitude="lon", datetime="datetime")

move_df.head()
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1

From a file

move_df = pm.read_csv('geolife_sample.csv')
move_df.head()
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1

Cool, huh? The default flag is Pandas. Look that:

type(move_df)
pymove.core.pandas.PandasMoveDataFrame

Let’s try creating one with Dask!

move_df = pm.read_csv('geolife_sample.csv', type_='dask')
move_df.head()
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1
type(move_df)
pymove.core.dask.DaskMoveDataFrame

What’s in MoveDataFrame?

The MoveDataFrame stores the following information:

orig_df = pm.read_csv('geolife_sample.csv')
move_df = orig_df.copy()

1. The kind of data he was instantiated

move_df.get_type()
'pandas'
move_df.columns
Index(['lat', 'lon', 'datetime', 'id'], dtype='object')
move_df.dtypes
lat                float64
lon                float64
datetime    datetime64[ns]
id                   int64
dtype: object

In addition to these attributes, we have some functions that allow us to:

1. View trajectory information

move_df.show_trajectories_info()
====================== INFORMATION ABOUT DATASET ======================

Number of Points: 217653

Number of IDs objects: 2

Start Date:2008-10-23 05:53:05     End Date:2009-03-19 05:46:37

Bounding Box:(22.147577, 113.548843, 41.132062, 121.156224)


=======================================================================

2. View the number of users

move_df.get_users_number()
1

3. Transform our data to

a. Numpy

move_df.to_numpy()
array([[39.984094, 116.319236, Timestamp('2008-10-23 05:53:05'), 1],
       [39.984198, 116.319322, Timestamp('2008-10-23 05:53:06'), 1],
       [39.984224, 116.319402, Timestamp('2008-10-23 05:53:11'), 1],
       ...,
       [39.999945, 116.327394, Timestamp('2009-03-19 05:46:12'), 5],
       [40.000015, 116.327433, Timestamp('2009-03-19 05:46:17'), 5],
       [39.999978, 116.32746, Timestamp('2009-03-19 05:46:37'), 5]],
      dtype=object)

b. Dicts

dict_data = move_df.to_dict()
dict_data.keys()
dict_keys(['lat', 'lon', 'datetime', 'id'])

c. DataFrames

df = move_df.to_data_frame()
print(type(move_df))
print(type(df))
df
<class 'pymove.core.pandas.PandasMoveDataFrame'>
<class 'pandas.core.frame.DataFrame'>
lat lon datetime id
0 39.984094 116.319236 2008-10-23 05:53:05 1
1 39.984198 116.319322 2008-10-23 05:53:06 1
2 39.984224 116.319402 2008-10-23 05:53:11 1
3 39.984211 116.319389 2008-10-23 05:53:16 1
4 39.984217 116.319422 2008-10-23 05:53:21 1
... ... ... ... ...
217648 39.999896 116.327290 2009-03-19 05:46:02 5
217649 39.999899 116.327352 2009-03-19 05:46:07 5
217650 39.999945 116.327394 2009-03-19 05:46:12 5
217651 40.000015 116.327433 2009-03-19 05:46:17 5
217652 39.999978 116.327460 2009-03-19 05:46:37 5

217653 rows × 4 columns

4. And even switch from a Pandas to Dask and back again!

new_move = move_df.convert_to('dask')
print(type(new_move))
move_df = new_move.convert_to('pandas')
print(type(move_df))
<class 'pymove.core.dask.DaskMoveDataFrame'>
<class 'pymove.core.pandas.PandasMoveDataFrame'>

5. You can also write files with

move_df.write_file('move_df_write_file.txt')
move_df.to_csv('move_data.csv')

6. Create a virtual grid

move_df.to_grid(8)
lon_min_x: 113.548843
lat_min_y: 22.147577
grid_size_lat_y: 263013
grid_size_lon_x: 105394
cell_size_by_degree: 7.218082747158498e-05

7. View the information of the last MoveDataFrame operation: operation name, operation time and memory use

move_df.last_operation
{'name': 'to_grid', 'time in seconds': 0.010914802551269531, 'memory': '0.0 B'}

8. Get data bound box

move_df.get_bbox()
(22.147577, 113.548843, 41.132062, 121.156224)

9. Create new columns:

a. tid: trajectory id based on Id and datetime

move_df.generate_tid_based_on_id_datetime()
move_df.head()
lat lon datetime id tid
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305

b. date: extract date on datetime

move_df.generate_date_features()
move_df.head()
lat lon datetime id tid date
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305 2008-10-23
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305 2008-10-23
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305 2008-10-23
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305 2008-10-23
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305 2008-10-23

c. hour: extract hour on datetime

move_df.generate_hour_features()
move_df.head()
lat lon datetime id tid date hour
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305 2008-10-23 5
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305 2008-10-23 5
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305 2008-10-23 5
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305 2008-10-23 5
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305 2008-10-23 5

d. day: day of the week from datatime.

move_df.generate_day_of_the_week_features()
move_df.head()
lat lon datetime id tid date hour day
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305 2008-10-23 5 Thursday
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305 2008-10-23 5 Thursday
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305 2008-10-23 5 Thursday
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305 2008-10-23 5 Thursday
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305 2008-10-23 5 Thursday

e. period: time of day or period from datatime.

move_df.generate_time_of_day_features()
move_df.head()
lat lon datetime id tid date hour day period
0 39.984094 116.319236 2008-10-23 05:53:05 1 12008102305 2008-10-23 5 Thursday Early morning
1 39.984198 116.319322 2008-10-23 05:53:06 1 12008102305 2008-10-23 5 Thursday Early morning
2 39.984224 116.319402 2008-10-23 05:53:11 1 12008102305 2008-10-23 5 Thursday Early morning
3 39.984211 116.319389 2008-10-23 05:53:16 1 12008102305 2008-10-23 5 Thursday Early morning
4 39.984217 116.319422 2008-10-23 05:53:21 1 12008102305 2008-10-23 5 Thursday Early morning

f. dist_to_prev, time_to_prev, speed_to_prev: create features of distance, time and speed to an GPS point P (lat, lon).

move_df = orig_df.copy()
move_df.generate_dist_time_speed_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime dist_to_prev time_to_prev speed_to_prev
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN NaN NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 13.690153 1.0 13.690153
2 1 39.984224 116.319402 2008-10-23 05:53:11 7.403788 5.0 1.480758
3 1 39.984211 116.319389 2008-10-23 05:53:16 1.821083 5.0 0.364217
4 1 39.984217 116.319422 2008-10-23 05:53:21 2.889671 5.0 0.577934

g. dist_to_prev, dist_to_next, dist_prev_to_next : three distance in meters to an GPS point P (lat, lon).

move_df = orig_df.copy()
move_df.generate_dist_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime dist_to_prev dist_to_next dist_prev_to_next
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN 13.690153 NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 13.690153 7.403788 20.223428
2 1 39.984224 116.319402 2008-10-23 05:53:11 7.403788 1.821083 5.888579
3 1 39.984211 116.319389 2008-10-23 05:53:16 1.821083 2.889671 1.873356
4 1 39.984217 116.319422 2008-10-23 05:53:21 2.889671 66.555997 68.727260

h. time_to_prev, time_to_next, time_prev_to_next : three time in seconds to an GPS point P (lat, lon).

move_df = orig_df.copy()
move_df.generate_time_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime time_to_prev time_to_next time_prev_to_next
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN 1.0 NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 1.0 5.0 6.0
2 1 39.984224 116.319402 2008-10-23 05:53:11 5.0 5.0 10.0
3 1 39.984211 116.319389 2008-10-23 05:53:16 5.0 5.0 10.0
4 1 39.984217 116.319422 2008-10-23 05:53:21 5.0 2.0 7.0

i. speed_to_prev, speed_to_next, speed_prev_to_next : three speed in meters by seconds to an GPS point P (lat, lon).

move_df = orig_df.copy()
move_df.generate_speed_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime speed_to_prev speed_to_next speed_prev_to_next
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN 13.690153 NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 13.690153 1.480758 3.515657
2 1 39.984224 116.319402 2008-10-23 05:53:11 1.480758 0.364217 0.922487
3 1 39.984211 116.319389 2008-10-23 05:53:16 0.364217 0.577934 0.471075
4 1 39.984217 116.319422 2008-10-23 05:53:21 0.577934 33.277998 9.920810

j. dist_to_prev, time_to_prev, speed_to_prev : distance, time and speed from previous ponint

move_df = orig_df.copy()
move_df.generate_dist_time_speed_features(inplace=False)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime dist_to_prev time_to_prev speed_to_prev
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN NaN NaN
1 1 39.984198 116.319322 2008-10-23 05:53:06 13.690153 1.0 13.690153
2 1 39.984224 116.319402 2008-10-23 05:53:11 7.403788 5.0 1.480758
3 1 39.984211 116.319389 2008-10-23 05:53:16 1.821083 5.0 0.364217
4 1 39.984217 116.319422 2008-10-23 05:53:21 2.889671 5.0 0.577934
... ... ... ... ... ... ... ...
217648 5 39.999896 116.327290 2009-03-19 05:46:02 7.198855 5.0 1.439771
217649 5 39.999899 116.327352 2009-03-19 05:46:07 5.291709 5.0 1.058342
217650 5 39.999945 116.327394 2009-03-19 05:46:12 6.241949 5.0 1.248390
217651 5 40.000015 116.327433 2009-03-19 05:46:17 8.462920 5.0 1.692584
217652 5 39.999978 116.327460 2009-03-19 05:46:37 4.713399 20.0 0.235670

217653 rows × 7 columns

K. situation: column with move and stop points by radius.

move_df = orig_df.copy()
move_df.generate_move_and_stop_by_radius()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id lat lon datetime dist_to_prev dist_to_next dist_prev_to_next situation
0 1 39.984094 116.319236 2008-10-23 05:53:05 NaN 13.690153 NaN nan
1 1 39.984198 116.319322 2008-10-23 05:53:06 13.690153 7.403788 20.223428 move
2 1 39.984224 116.319402 2008-10-23 05:53:11 7.403788 1.821083 5.888579 move
3 1 39.984211 116.319389 2008-10-23 05:53:16 1.821083 2.889671 1.873356 move
4 1 39.984217 116.319422 2008-10-23 05:53:21 2.889671 66.555997 68.727260 move

9. Get time difference between max and min datetime in trajectory data.

move_df.time_interval()
Timedelta('146 days 23:53:32')

And that’s it! See upcoming notebooks to learn more about what PyMove can do!