Postgres and JSON

Objective

Last times I make research about Internet security. One of interesting things is SSL certificates. To analyze firstly I download RAW data from Censys with 443 port scan. It comes in JSON format.

Onset

Firstly, I processed file with script to extract IP from other data and skip all errors. In PHP it look like this (almost all PHP code is skipped because not matter in reviewed aspect):

while ($s = fgets($fin)) {
//skip errors like this 
//{"ip":"33.44.55.66","timestamp":"2017-11-15T18:38:30+01:00","data":{"http":{}},"error":"..."}
 if (strpos($s,'"http":{}},"error":') > 0) continue; 
//split JSON to separate IP field, ignore date and prepare all other to insert into DB as jsonb
//{"ip":"22.33.44.55","timestamp":"2017-11-15T18:38:30+01:00","data":{"http":{"response":{"..."}}}
 $s = str_replace('\u0000','',$s); //remove \u0000 because jsonb datatype does not supported it in Postgresql
 $ip = extract_ip($s);
     $data = extract_data($s);
 fputs($fout, $ip+"\t"+$data);
} 

Okay, next step is prepare table and load data:

CREATE TABLE public.zgrabtmp (ip inet NOT NULL, jdata jsonb) 
WITH (OIDS = FALSE) TABLESPACE pg_default;
COPY zgrabtmp FROM '/path/to/processed.file' quote e'\x01' delimiter e'\t' CSV;

Import take a time because source file size is hundreds of GB.

Problem

After import we have around 40 millions records in our table. Now we can check, for example, how many sertificates have version 2:

SELECT count(*) from zgrabtmp where 
to_number(jdata->'data'->'http'->'response'->'request'->'tls_handshake'->
'server_certificates'->'certificate'->'parsed'->>'version','9') = 2;


On my most powerful desktop with 6-core Broadwell, HW RAID and 128 RAM with optimized Postgresql preferences it takes 35 mins. Very slow to real requests, I think. Of course, we can make an indexes for each JSON field, for example:

CREATE INDEX serveridxgin public.zgrabtmp USING gin
    ((((((jdata->'jdata'::text)->'http'::text)->'response'::text)->'headers'::text)->'server'::text))
    TABLESPACE pg_default;

But too much indexes required in fact and too many space to store it.

Solution


By the other side, it is possible to create tables for all required data and fill it by extracting data from json. For example, I create table for certificates:

CREATE TABLE public.certs
(
    ip inet NOT NULL,
    validfrom date,
    validto date,
    version integer,
    distribution character varying(2000),
...
    CONSTRAINT certs_pkey PRIMARY KEY (ip)
)

And fill it:

insert into certs select ip, 
to_date(jdata->'data'->'http'->...->'validity'->>'start','YYYY-MM-DD'),
to_date(jdata->'data'->'http'->...->'validity'->>'end','YYYY-MM-DD'),
to_number(jdata->'data'->'http'->...->'parsed'->>'version','9'),
trim(both'["|"]' from jdata->'data'->...->'extensions'->>'crl_distribution_points')
...;

Yes, this query will take quite some time, but now to get count of certificates version 2 takes only few dozens of seconds:

select count(*) from certs where version = 2;

To store information about hosts software I make another table. As a join key it is possible to use constraint based on ip field in both tables.

Conclusion


Store JSON data "as is" in database allow to import data quickly and start to mine data immediately. But if data size is really big (for example, bigger then available RAM), it working too slow. So, if JSON data is regular and of the same type in all records, better way is using classic flat database tables.

Comments

Popular posts from this blog

Installing and using a free GeoIP database

Create custom Spinner on ActionBar