A collection of tutorials, code and tools to help you get better using Oracle
07 February 2012
Sometimes when plundering through legacy systems, you come across things that make just about zero sense. When you ask around, people say there must be a reason, some explanation for the madness, and often there is, but often there isn't either.
I recently came across such an example, and it sounds like something right out of the Daily WTF. The application had a table like the following:
create sequence t_master_seq start with 100000000 increment by 1; -- generate 500K random rows create table t_master as select t_master_seq.nextval pk, lpad('a', 400, 'a') padding, dbms_random.string('U', 20) email, dbms_random.string('U', 20) surname, dbms_random.string('U', 20) forename, dbms_random.string('U', 20) ship_to_address_hash, dbms_random.string('U', 20) invoice_to_address_hash from dual connect by level <= 500000; -- quadruple the data, giving 4 rows of each key begin for i in 1..2 loop insert into t_master select t_master_seq.nextval, padding, email, surname, forename, ship_to_address_hash, invoice_to_address_hash from t_master; end loop; end; / create unique index t_master_pk on t_master(pk);
This table holds transactions, and I have filled each key field (email, forename, surname etc) with some random data. Then I ensure there are 4 rows that contain each key field by duplicating each row and replacing the PK value with a new unique one. Each email will appear in the table in 4 different transactions. Same goes for the other fields too.
There is nothing strange about that, but then there is a child table for each key field, with an index on the key field, along with carrying down the primary key:
-- child table 1 create table t_email as select pk, email from t_master; create unique index t_email_pk on t_email(pk); create index t_email_idx1 on t_email(email); -- child table 2 create table t_surname as select pk, surname from t_master; create unique index t_surname_pk on t_surname(pk); create index t_surname_idx1 on t_surname(surname); -- child table 3 create table t_forename as select pk, forename from t_master; create unique index t_forename_pk on t_forename(pk); create index t_forename_idx1 on t_forename(forename); -- child table 4 create table t_ship_to_address_hash as select pk, ship_to_address_hash from t_master; create unique index t_ship_to_address_hash_pk on t_ship_to_address_hash(pk); create index t_ship_to_address_hash_idx1 on t_ship_to_address_hash(ship_to_address_hash); -- child table 5 create table t_invoice_to_address_hash as select pk, invoice_to_address_hash from t_master; create unique index t_invoice_to_address_hash_pk on t_invoice_to_address_hash(pk); create index t_invoice_to_address_hash_idx1 on t_invoice_to_address_hash(invoice_to_address_hash);
At this point you may ask yourself, what is the point of all these child tables? All they do is hold pieces of the original table. The primary keys are there to prove uniqueness. Each row in each child table is just a piece of a corresponding row in the master table, with no additional information.
It turns out the data can be queried on any combination of the 5 key fields, so the problem is slightly tricky as the fields are not related. One index cannot efficiently solve the queries, and a bitmap index is out as the tables are updated in an OLTP fashion.
Without resorting to full text indexes, the best way to answer the application requirements is something like:
select * from t_master where email = :b1 union all select * t_master where forename = :b2 etc ...
Even with this tricky requirement, there is still no reason for these additional five tables. If you read this far, you are probably waiting for the exciting reason, so I resorted to asking the designer. It is a star schema, it was designed this way for performance, he told me
The problem is, is is not a star schema and the performance is worse than it should be, much worse. I don't think this would surprise many people, but it really hammers home the importance of benchmarking, and knowing how to prove one method is better than another. In this case, all we need is Autotrace to prove how much extra work this design causes.
Consider a query on just one of the fields, email. With the current design, to retrieve all the rows with a given email, the following query is required:
select m.* from t_master m, t_email e where e.email = 'BVVEDVZRCZHPHXEYDPHM' and e.pk = m.pk;
This takes 22 logical IOs, and on the real system, many of them will be from disk.
However, if an index was created on t_master.email, the query can be answered in 8 logical IOs.
select * from t_master m where m.email = 'BVVEDVZRCZHPHXEYDPHM';
That is almost a third of the work, and things get worse. The cache becomes less efficient due to all this duplicated data and the extra 5 primary key indexes. That doesn't even address the overhead of maintaining the indexes, and the application logic to split those rows up into many and all the extra disk space required.
Sorry, I cannot. There is no explanation I can find for the madness in this case, but I'm petty sure I can half the application response time with a few simple changes.