High Performance in Greenplum Database

Too often I come across instances where performance is an issue, but it doesn't always need to be. There are basic rules that I always apply for myself, to make this a more managed and controlled process. This ensures that I work towards a goal and my steps will lead to a result.

The first rule should be "Have a look at it one step at a time". I myself used to be too enthusiastic by spotting multiple opportunities to enhance performance, but it didn't always result in the expected increase. Simply doing a step at a time feels slower, but actually leads to better results and could actually take less time.

Let's say it's the load performance that is not up to expectation. First, let's get the facts straight:

how many source file(s),
where are they located,
what data is in the source file,
how many etl servers (nr of servers and the OS),
what is the connectivity between etl server and GP cluster (1gE, 10gE, nr of NIC's),
how many Greenplum File Distribution processes (i.e. gpfdist processes) are configured,
what is the definition of the external table in Greenplum,
what is the defnition of the target table (distribution key, partitioning, ...),
is there any ETL being done in the load process (i.e. the select clause of the insert statement),
what is size of the Greenplum database cluster (number of servers, number of primaries per segment server),
how much data is already in the database
what is the expectation?

These are questions that can be simply answered and will give a good starting picture of the environment. Based on these facts, you can first determine if the expectation is realistic and if so, get a picture of by how much the actual loading time is off.

Some straight forward guide lines, based on a non-trivial loading environment (i.e. multiple, large files with at least a billion (10^9) rows):

Split your source files over the available etl servers (a single server is ok too);
if the etl server has multiple NIC's setup one gpfdist process per NIC, balancing the load across all available subnets;
ensure the external table definition points to each NIC on the etl server for its set of files (for example, with 1 etl server and 2 NICs: load even numbered files over NIC1 and load odd numbered files over NIC2);
in the external table, use format TEXT over CSV where possible, as TEXT does block IO and CSV does line-by-line IO;
for large, fact like, tables always try to use an Append-Only instead of a Heap table; this is much faster as there is no overhead for needing to support Updates and Deletes;
avoid indexes in general; this is an important tuning component in traditional OLTP (non-MPP) databases, but not crucial in Greenplum's MPP database, as data is already distributed over the segments and addionally may be partitioned; (part) table scans are ok in Greenplum!

Now you have set yourself up for performance. I would be very surprised if this does not increase your performance.

Ed

View All

No Events found!

Big Data General

High Performance in Greenplum Database

Was this post helpful?