In this post we will see how to efficiently perform data bulk load in a spatial PostGIS database running in a Linux environment, coming the data from an airborne LiDAR survey.
Spatial databases can be used to store data collected in remote sensing Light Detection and Ranging[1] (LiDAR) surveys. LiDAR technology has many applications is forestry, where it is used in the generation Digital Terrain Models and Digital Surface Models[2], estimation of forest fuel variables…
In an airborne LiDAR survey, a LiDAR sensor emits laser pulses towards ground. Laser pulses are reflected when reaching obstacles as trees or the ground and, after a time elapse, echoes are received by the sensor, which records elapsed time and echo intensity. In addition, this kind of sensors have a kinematic Global Positioning System[3] (GPS) and an Inertial Measurement Unit[4] (IMU) that continuously registers the spatial position of the aircraft. Thus, high precision geo-referenced elevation and laser reflection intensity data is collected.
LiDAR surveys produce high volume of data. For instance, as of 2010 a commercial airborne LiDAR survey provided 12 points per square meter, which sums 120,000 points for an area as small as a hectare.
Thus, the problem of efficiently loading LiDAR data in a spatial database arises. The obvious approach, using INSERT queries, will work, but at the expense of consuming a lot of time. In sections 1, 2 and 3 we will see how to achieve efficient bulk data load in PostGIS by using the command COPY[5]. In section 4 some benchmarking is done, comparing the perfomance of data loading with COPY and INSERT.
1. Create the table.
The first task is creating the table that will store the LiDAR data. In our example, this table will be given the name ‘lidar_data’ and will belong to a database called ‘db_lidar’. The information stored in this table is the recorded pulse echo intensity, the elevation (z) and the 2D point coords.
CREATE TABLE lidar_data (
intensity FLOAT NOT NULL,
z FLOAT NOT NULL
);
SELECT AddGeometryColumn('', 'lidar_data', 'the_geom', -1, 'POINT', 2);
Note that geographic information is stored without spatial reference system (SRS) id.
2. Create the input CSV file.
The second task is creating a input CSV file. This file contains the data to be inserted into the table and will be directly read by the COPY command. In the present example, this file will be named ‘lidar_data.csv’ and should look like shown below:
"intensity";"z";"the_geom" "101.5";"601.200";"010100000000000080e514224100000040d9405241" "191.4";"604.820";"0101000000000000c0e514224100000020d9405241" "287.1""600.970";"010100000000000060e514224100000080d8405241" "275.5";"600.920";"010100000000000040e514224100000020d8405241" "63.8";"601.480";"010100000000000040e5142241000000c0d7405241" ...
The point information (coords. x and y) are stored in the column ‘the_geom’ using its Well-Known Binary (WKB) [note 1] representation. The character used as column separator is semicolon (ASCII code 59) and the one used as quote is double quote (ASCII code 34).
Since the COPY command is picky about the input data it receives, data formatting must be taken with special care. In particular, column order in the CSV file must exactly match column order given in the CREATE TABLE and SELECT AddGeometry statements (section 1).
3. Perform LiDAR data load.
The third task is divided in two steps. First step, create a text file named, for example, ‘bulk-load.sql’ containing the following line:
COPY lidar_data
FROM '/path/to/lidar_data.csv'
USING DELIMITERS ';' CSV HEADER QUOTE AS '"';
Second step, as postgres user – actually as database superuser – [note 2] use the psql command to execute the query stored in the file ‘bulk-load.sql’:
postgres@myComputer$ psql --dbname db_lidar -f /path/to/bulk-load.sql
Note that we have passed the COPY command the characters used as column delimiter and as quote. On the other hand, HEADER tells the COPY command to ignore the first line of the CSV file, which contains the column names.
In order to speed up data insertion, the shown table has no index. If needed, an index can be added after data insertion. Adding a GiST index[6, 7] is very recommendable.
$ CREATE INDEX lidar_data_idx ON lidar_data USING GIST (the_geom); $ VACUUM ANALYZE lidar_data;
4. Some benchmarking
By using the COPY command, the data stored in a CSV file counting ~2,108,000 lines – or points with its z and intensity attributes – was inserted into the table in ~20.5 seconds. PostgreSQL and PostGIS versions were respectively, 8.3.8 and 1.3.5, and hardware main features were:
- processor: 2 x Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
- cache size: 3,072 kilobytes
- bogomips: 5,852.47
- RAM size: 3,597,972 kilobytes
In the same system, loading just 50,000 randomly picked points of the previous test by executing 50,000 INSERT queries in a single transaction took ~18.5 seconds. Apparently, in both tests only one of two cores the processor had was used.
| number of points | method | time (s) | points per second |
| 50000 | INSERT[*] | 18.5 | 2703 |
| 2108000 | COPY | 20.5 | 102829 |
| [*] all inserts were done in the same transaction | |||
As show in the previous table, the COPY command yields ~102,829 inserted points per second, while the same ratio for the INSERT queries is ~2,703.
notes
[note 1] In a previous post of this blog it was seen how to get the WKB representation of a point using the JTS Topology Suite.
[note 2] If not executed as postgres user, you will likely get an error message like this: “ERROR: must be superuser to COPY to or from a file”; this is because of security concerns. If not possible, an alternative can be psql’s command \copy.
references
[1] “Light Detection and Ranging”, Wikipedia entry: http://en.wikipedia.org/wiki/LIDAR
[2] “Digital Elevation Model”, Wikipedia entry: http://en.wikipedia.org/wiki/Digital_elevation_model
[3] “Global Positioning System”, Wikipedia entry: http://en.wikipedia.org/wiki/Global_Positioning_System
[4] “Inertial Measurement Unit”, Wikipedia entry: http://en.wikipedia.org/wiki/Inertial_measurement_unit
[5] PostgreSQL COPY command reference: PostgreSQL: Documentation: Manuals: PostgreSQL 8.3: COPY
[6] “Building Indexes” (from “Using PostGIS: Data Management and Queries”): http://postgis.refractions.net/docs/ch04.html#id2638705
[7] “Spatial Indexing” (from “Introduction to PostGIS”): http://revenant.ca/www/postgis/workshop/indexing.html
Hello,
Thank you very much for this interesting proposal. Maybe should you add the time to generate the Excel file in your benchmarking.
Maybe a stupid question : why no SRS is associated to the LIDAR data ?
Laurent
The LiDAR data I was given were georeferenced. However, I was unable to perform the bulk data load with associated SRS.
Suggestions are welcome…
Very interesting post, we’ve been working on something similar over the last couple of months. Hardware does make a big difference, we have a dedicated server and using your example on this we’re getting a ~2.7 speedup.
We’re doing some detailed work on creating a spatial data framework using PostGIS for LiDAR data. Using pg_bulkload and our optimisations we can nearly double the points per second over COPY.
Our real goal is to create a system for the storage, access, processing and visualisation of TBs of LiDAR (and imagery). So we really need the spatial index. Our process from file to a usable DB table (including file processing and index creation) is 39,900 points per second (more information is here (http://goo.gl/Ro5S5).
Some examples
http://www.youtube.com/watch?v=WQlLMHV4gno&feature=player_embedded
http://www.youtube.com/watch?v=GQ6a_-78LNI&feature=player_embedded
Very interesting. I hope I can get time for reading your paper.
ps: I have done some editting in your comment, mainly improved formatting (put the benchmark data in tables). I hope you don’t care.