01 - Exploring MoveDataFrame¶
To work with Pymove you need to import the data into our data structure: MoveDataFrame!
MoveDataFrame is an abstraction that instantiates a new data structure that manipulates the structure the user wants. This is done using the Factory Method design pattern. This structure allows the interface to be implemented using different representations and libraries that manipulate the data.
We have an interface that delimits the scope that new implementing classes should have. We currently have two concrete classes that implement this interface: PandasMoveDataFrame and DaskMoveDataFrame (under construction), which use Pandas and Dask respectively for data manipulation.
It works like this: The user instantiating a MoveDataFrame provides a flag telling which library they want to use for manipulating this data.
Now that we understand the concept and data structure of PyMove, hands on!
MoveDataFrame¶
A MoveDataFrame must contain the columns: - lat
: represents the
latitude of the point. - lon
: represents the longitude of the point.
- datetime
: represents the date and time of the point.
In addition, the user can enter several other columns as trajectory id. If the id is not entered, the points are supposed to belong to the same path.
Creating a MoveDataFrame¶
A MoveDataFrame can be created by passing a Pandas DataFrame, a list, dict or even reading a file. Look:
import pymove as pm
from pymove import MoveDataFrame
From a list¶
list_data = [
[39.984094, 116.319236, '2008-10-23 05:53:05', 1],
[39.984198, 116.319322, '2008-10-23 05:53:06', 1],
[39.984224, 116.319402, '2008-10-23 05:53:11', 1],
[39.984224, 116.319402, '2008-10-23 05:53:11', 1],
[39.984224, 116.319402, '2008-10-23 05:53:11', 1],
[39.984224, 116.319402, '2008-10-23 05:53:11', 1]
]
move_df = MoveDataFrame(data=list_data, latitude="lat", longitude="lon", datetime="datetime", traj_id="id")
move_df.head()
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
4 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
From a dict¶
dict_data = {
'lat': [39.984198, 39.984224, 39.984094],
'lon': [116.319402, 116.319322, 116.319402],
'datetime': ['2008-10-23 05:53:11', '2008-10-23 05:53:06', '2008-10-23 05:53:06']
}
move_df = MoveDataFrame(data=dict_data, latitude="lat", longitude="lon", datetime="datetime", traj_id="id")
move_df.head()
lat | lon | datetime | |
---|---|---|---|
0 | 39.984198 | 116.319402 | 2008-10-23 05:53:11 |
1 | 39.984224 | 116.319322 | 2008-10-23 05:53:06 |
2 | 39.984094 | 116.319402 | 2008-10-23 05:53:06 |
From a DataFrame Pandas¶
import pandas as pd
df = pd.read_csv('geolife_sample.csv', parse_dates=['datetime'])
move_df = MoveDataFrame(data=df, latitude="lat", longitude="lon", datetime="datetime")
move_df.head()
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
From a file¶
move_df = pm.read_csv('geolife_sample.csv')
move_df.head()
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
Cool, huh? The default flag is Pandas. Look that:
type(move_df)
pymove.core.pandas.PandasMoveDataFrame
Let’s try creating one with Dask!
move_df = pm.read_csv('geolife_sample.csv', type_='dask')
move_df.head()
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
type(move_df)
pymove.core.dask.DaskMoveDataFrame
What’s in MoveDataFrame?¶
The MoveDataFrame stores the following information:
orig_df = pm.read_csv('geolife_sample.csv')
move_df = orig_df.copy()
1. The kind of data he was instantiated¶
move_df.get_type()
'pandas'
move_df.columns
Index(['lat', 'lon', 'datetime', 'id'], dtype='object')
move_df.dtypes
lat float64
lon float64
datetime datetime64[ns]
id int64
dtype: object
In addition to these attributes, we have some functions that allow us to:
1. View trajectory information¶
move_df.show_trajectories_info()
====================== INFORMATION ABOUT DATASET ======================
Number of Points: 217653
Number of IDs objects: 2
Start Date:2008-10-23 05:53:05 End Date:2009-03-19 05:46:37
Bounding Box:(22.147577, 113.548843, 41.132062, 121.156224)
=======================================================================
3. Transform our data to¶
a. Numpy¶
move_df.to_numpy()
array([[39.984094, 116.319236, Timestamp('2008-10-23 05:53:05'), 1],
[39.984198, 116.319322, Timestamp('2008-10-23 05:53:06'), 1],
[39.984224, 116.319402, Timestamp('2008-10-23 05:53:11'), 1],
...,
[39.999945, 116.327394, Timestamp('2009-03-19 05:46:12'), 5],
[40.000015, 116.327433, Timestamp('2009-03-19 05:46:17'), 5],
[39.999978, 116.32746, Timestamp('2009-03-19 05:46:37'), 5]],
dtype=object)
b. Dicts¶
dict_data = move_df.to_dict()
dict_data.keys()
dict_keys(['lat', 'lon', 'datetime', 'id'])
c. DataFrames¶
df = move_df.to_data_frame()
print(type(move_df))
print(type(df))
df
<class 'pymove.core.pandas.PandasMoveDataFrame'>
<class 'pandas.core.frame.DataFrame'>
lat | lon | datetime | id | |
---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
... | ... | ... | ... | ... |
217648 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 5 |
217649 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 5 |
217650 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 5 |
217651 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 5 |
217652 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 5 |
217653 rows × 4 columns
4. And even switch from a Pandas to Dask and back again!¶
new_move = move_df.convert_to('dask')
print(type(new_move))
move_df = new_move.convert_to('pandas')
print(type(move_df))
<class 'pymove.core.dask.DaskMoveDataFrame'>
<class 'pymove.core.pandas.PandasMoveDataFrame'>
5. You can also write files with¶
move_df.write_file('move_df_write_file.txt')
move_df.to_csv('move_data.csv')
6. Create a virtual grid¶
move_df.to_grid(8)
lon_min_x: 113.548843
lat_min_y: 22.147577
grid_size_lat_y: 263013
grid_size_lon_x: 105394
cell_size_by_degree: 7.218082747158498e-05
7. View the information of the last MoveDataFrame operation: operation name, operation time and memory use¶
move_df.last_operation
{'name': 'to_grid', 'time in seconds': 0.010914802551269531, 'memory': '0.0 B'}
9. Create new columns:¶
a. tid
: trajectory id based on Id and datetime¶
move_df.generate_tid_based_on_id_datetime()
move_df.head()
lat | lon | datetime | id | tid | |
---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 |
b. date
: extract date on datetime¶
move_df.generate_date_features()
move_df.head()
lat | lon | datetime | id | tid | date | |
---|---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 | 2008-10-23 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 | 2008-10-23 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 | 2008-10-23 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 | 2008-10-23 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 | 2008-10-23 |
c. hour
: extract hour on datetime¶
move_df.generate_hour_features()
move_df.head()
lat | lon | datetime | id | tid | date | hour | |
---|---|---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 | 2008-10-23 | 5 |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 | 2008-10-23 | 5 |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 | 2008-10-23 | 5 |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 | 2008-10-23 | 5 |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 | 2008-10-23 | 5 |
d. day
: day of the week from datatime.¶
move_df.generate_day_of_the_week_features()
move_df.head()
lat | lon | datetime | id | tid | date | hour | day | |
---|---|---|---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday |
e. period
: time of day or period from datatime.¶
move_df.generate_time_of_day_features()
move_df.head()
lat | lon | datetime | id | tid | date | hour | day | period | |
---|---|---|---|---|---|---|---|---|---|
0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday | Early morning |
1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday | Early morning |
2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday | Early morning |
3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday | Early morning |
4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 | 12008102305 | 2008-10-23 | 5 | Thursday | Early morning |
f. dist_to_prev
, time_to_prev
, speed_to_prev
: create features of distance, time and speed to an GPS point P (lat, lon).¶
move_df = orig_df.copy()
move_df.generate_dist_time_speed_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | dist_to_prev | time_to_prev | speed_to_prev | |
---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | NaN | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 13.690153 | 1.0 | 13.690153 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 7.403788 | 5.0 | 1.480758 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1.821083 | 5.0 | 0.364217 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 2.889671 | 5.0 | 0.577934 |
g. dist_to_prev
, dist_to_next
, dist_prev_to_next
: three distance in meters to an GPS point P (lat, lon).¶
move_df = orig_df.copy()
move_df.generate_dist_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | dist_to_prev | dist_to_next | dist_prev_to_next | |
---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | 13.690153 | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 13.690153 | 7.403788 | 20.223428 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 7.403788 | 1.821083 | 5.888579 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1.821083 | 2.889671 | 1.873356 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 2.889671 | 66.555997 | 68.727260 |
h. time_to_prev
, time_to_next
, time_prev_to_next
: three time in seconds to an GPS point P (lat, lon).¶
move_df = orig_df.copy()
move_df.generate_time_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | time_to_prev | time_to_next | time_prev_to_next | |
---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | 1.0 | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1.0 | 5.0 | 6.0 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 5.0 | 5.0 | 10.0 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 5.0 | 5.0 | 10.0 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 5.0 | 2.0 | 7.0 |
i. speed_to_prev
, speed_to_next
, speed_prev_to_next
: three speed in meters by seconds to an GPS point P (lat, lon).¶
move_df = orig_df.copy()
move_df.generate_speed_features()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | speed_to_prev | speed_to_next | speed_prev_to_next | |
---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | 13.690153 | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 13.690153 | 1.480758 | 3.515657 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1.480758 | 0.364217 | 0.922487 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 0.364217 | 0.577934 | 0.471075 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 0.577934 | 33.277998 | 9.920810 |
j. dist_to_prev
, time_to_prev
, speed_to_prev
: distance, time and speed from previous ponint¶
move_df = orig_df.copy()
move_df.generate_dist_time_speed_features(inplace=False)
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | dist_to_prev | time_to_prev | speed_to_prev | |
---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | NaN | NaN |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 13.690153 | 1.0 | 13.690153 |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 7.403788 | 5.0 | 1.480758 |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1.821083 | 5.0 | 0.364217 |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 2.889671 | 5.0 | 0.577934 |
... | ... | ... | ... | ... | ... | ... | ... |
217648 | 5 | 39.999896 | 116.327290 | 2009-03-19 05:46:02 | 7.198855 | 5.0 | 1.439771 |
217649 | 5 | 39.999899 | 116.327352 | 2009-03-19 05:46:07 | 5.291709 | 5.0 | 1.058342 |
217650 | 5 | 39.999945 | 116.327394 | 2009-03-19 05:46:12 | 6.241949 | 5.0 | 1.248390 |
217651 | 5 | 40.000015 | 116.327433 | 2009-03-19 05:46:17 | 8.462920 | 5.0 | 1.692584 |
217652 | 5 | 39.999978 | 116.327460 | 2009-03-19 05:46:37 | 4.713399 | 20.0 | 0.235670 |
217653 rows × 7 columns
K. situation
: column with move and stop points by radius.¶
move_df = orig_df.copy()
move_df.generate_move_and_stop_by_radius()
move_df.head()
VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))
id | lat | lon | datetime | dist_to_prev | dist_to_next | dist_prev_to_next | situation | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | NaN | 13.690153 | NaN | nan |
1 | 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 13.690153 | 7.403788 | 20.223428 | move |
2 | 1 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 7.403788 | 1.821083 | 5.888579 | move |
3 | 1 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1.821083 | 2.889671 | 1.873356 | move |
4 | 1 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 2.889671 | 66.555997 | 68.727260 | move |
9. Get time difference between max and min datetime in trajectory data.¶
move_df.time_interval()
Timedelta('146 days 23:53:32')