Looking For A Giga-Record Database

parityboy

Limp Gawd
Joined
Nov 13, 2010
Messages
390
I'm in the process of planning a NNTP download robot for KDE 4.x. The plan is for it to have feature parity with NewsLeecher, which I've been using for quite a while. Unfortunately, NewsLeecher has performance issues.

What I'm planning will not be simply an NZB downloader, but a fully fledged NNTP client that stores headers locally and allows local full-text searching. However, some newsgroups can have header counts reaching into the hundreds of millions and sometimes further - I've seen a header count of over 3 billion for one particular newsgroup.

So what I'm looking for is an embeddable database capable of handling that many records, with stability and a decent level of performance. I've been looking at NoSQL databases, but I'm not sure whether key/value or document databases would be a better fit.

Or should I tie the project directly into Akonadi and use that to index the headers instead? Akonadi currently uses MySQL as a storage backend, and I do not know how this will perform on the average modern desktop with billion-record tables.

Any advice is appreciated. :)
 
MySQL is garbage. SQL Server has an Express edition and an embeddable edition. Plus, a developer's edition.

Why do you need billion-row tables? Why can't you partition?
 
Thank you for the response. :) I'm looking for the best solution for handling 1Bn+ records. I'm not sure whether an SQL or a NoSQL database is the answer. As I stated, the project will be written for Linux + KDE 4.x so Microsoft products are out of the question, since I would like the database to run in-process.

I know products like Newsman Pro can use an external database, and I may support that with a plug-in at some point in the future, but for now I'd like the database to be embedded.

What is your experience with NoSQL databases such as Berkeley DB and Tokyo Cabinet - how well do they perform with billion-count records?

As a sample, this is what I believe I will be storing in each record.

Code:
Path:   news.astraweb.com!router1.astraweb.com!news.glorb.com!feeder.erje.net!feeder2.news.saunalahti.fi!reader1.news.saunalahti.fi!53ab2750!not-for-mail
From:  Sender <[email protected]>
Newsgroups:   alt.fan.robert-jordan
Subject:   Exciting times
Message-ID:   <[email protected]>
References:   <17811870-955f-420e [email protected]>
X-Newsreader:   Forte Free Agent 1.93/32.576 English (American)
MIME-Version:   1.0
Content-Type:   text/plain; charset="us-ascii"
Content-Transfer-Encoding:   7bit
Lines:   50
Date:   09:57 +0300
NNTP-Posting-Host:  xxx.xxx.xxx.xxx
X-Complaints-To:   [email protected]
X-Trace:   10:04 EEST)
NNTP-Posting-Date:   10:04 EEST
Organization:   Saunalahti Customer

Obviously I won't need all of these headers. I'm also wondering if a K/V database would be better suited, or perhaps a something like CLucene? SQLite?

Opinions?
 
Berkeley DB is just a record manager. You're avoding SQL and talking directly to the storage engine, which means it's up to you to write the code to implement the query you want to do. I'm not sure why you think it might be of benefit. Same goes for Tokyo Cabinet -- though I've used BDB and not TC.

It's not obvious to me that you don't need all the headers; if you don't need them, why did you say that's what you figure you'd be storing in each record? Regardless, what you're storing is not nearly as interesting as how you intend to query the data.

You still haven't explained why you need billion-record capacity; you haven't explained why you can't partition.
 
I'm not a DBA, so I know little of partitioning. :) Also, I need the database to run in-process - how many in-process databases support partitioning? Is a relational database the best option for storing these headers, or is a K/V database better?

Regarding the sample I posted, it was just a straight copy & paste. I don't think I need "X-Newsreader:" - can you see a use for it?

As for the billion-record capability - if you've been anywhere near binary newsgroups, you will know that many groups have header counts reaching into the hundreds of millions, with a few touching 1Bn+ and one currently at 5Bn+ last time I looked at it. Again, I'm not a DBA so I assumed 1 table per newsgroup would be the basic model. I know that MongoDB can query billions of records in mere seconds, but MongoDB cannot be used in-process.

What was your experience with BerkeleyDB? What was the record count, and how did it perform?
 
All record managers support partitioning. Either they support it through constructs: you tell the engine a store is partitioned, and it partitions it -- or they support it manually: you tell the engine you have two different stores, and you partition them yourself.

Having a table per newsgroup is a form of partitioning.

I have no idea what your application will do, so I don't know which headers you might or might not want. Among the headers that you might want, I don't know what you'll use them for or how you'll query them.
 
I'm in the process of planning a NNTP download robot...
...
What I'm planning will not be simply an NZB downloader, but a fully fledged NNTP client that stores headers locally and allows local full-text searching.

I'm not sure how to clarify this further. The application will allow the user to pull the headers from the NNTP server and store them locally, search or browse for items, select items for download and place them in a queue, download the items from the NNTP server and decode them into binaries before saving the binaries on disk.

The application will have to store the NNTP headers locally in some kind of database in order to enable the search capabilities, which is why I'm looking for an in-process database for this application. The user will be able to search the headers by subject and sort them by date/subject/status/poster/newsgroup and perhaps other criteria.

I'm also thinking that the download queue would be in a separate table, and would contain references to the actual headers, rather than copies.
 
MySQL is garbage. SQL Server has an Express edition and an embeddable edition. Plus, a developer's edition.

Why do you need billion-row tables? Why can't you partition?

Tell that to Twitter, PayPal, and Facebook. :)
 
Tell that to Twitter, PayPal, and Facebook. :)

I'm not much of DBA but I assume Mike is saying a table doesn't grow to a billion records on its own, it's partitioned by some means into smaller more manageable chunks. Either by a built in mechanism, or by creating multiple tables yourself.

For example, I have an application that records many gigs of data.. I partition the binary files into small few hundred megabyte chunks. So even though I have "gigs of data" for practical reasons my files are a few hundred megabytes each.
 
Tell that to Twitter, PayPal, and Facebook. :)

Those companies aren't using MySQL. They started with MySQL, then significantly enhanced and modified the code. If you have time to write your own database management system, MySQL is a great place to start. If you don't have that kind of time, you're better off with a package that gives better performance and manageability out of the box. Even after that effort, they don't use MySQL exclusively. In fact, I think PayPal is better known for using Oracle rather than MySQL.
 
Tell that to Twitter, PayPal, and Facebook. :)

Each of those groups basically has their own RDBMS, which borrows from the MySQL engine but is almost entirely custom in its operation.

They have grown to the point where their needs required customizations and different solutions.
Twitter, for example, has some DBA notoriety because they have deliberately denormalized a lot of their data. Does this mean we throw 3NF away?
 
I'm not much of DBA but I assume Mike is saying a table doesn't grow to a billion records on its own, it's partitioned by some means into smaller more manageable chunks. Either by a built in mechanism, or by creating multiple tables yourself.

For example, I have an application that records many gigs of data.. I partition the binary files into small few hundred megabyte chunks. So even though I have "gigs of data" for practical reasons my files are a few hundred megabytes each.

How does your application perform in terms of disk performance? Is it running on a single disk? I ask because the nature of my project means that 9 times out of 10, the end user will be running on a single-disk system with ~4GB RAM.

On a single-disk system, do multiple files perform better than a single big file?
 
Each of those groups basically has their own RDBMS, which borrows from the MySQL engine but is almost entirely custom in its operation.

They have grown to the point where their needs required customizations and different solutions.
Twitter, for example, has some DBA notoriety because they have deliberately denormalized a lot of their data. Does this mean we throw 3NF away?

Why on earth would you throw away 3NF? Twitter denormalizing their DB makes sense because of the way the program functions (relatively small inserts and billions of reads). Denormalizing something like SAP would be a disaster. It goes back to each of these companies are specializing their DBs to work for their specific needs, highly transactional or highly reportable. Data warehouses have denormalized databases for decades to make reporting upon TBs of databases manageable. Paritioning, specialized indexing, etc. all help to make this possible.
 
How does your application perform in terms of disk performance? Is it running on a single disk? I ask because the nature of my project means that 9 times out of 10, the end user will be running on a single-disk system with ~4GB RAM.

On a single-disk system, do multiple files perform better than a single big file?

No to be a dick but I think you really need to do some testing and get a better handle on what you are trying to accomplish. You have a good idea but the implementation of it hasn't been flushed out. I would do some testing with a few different engines that fit into your overall project and then start looking for ways to tweak it. Your asking us to suggest a DB engine that will support billion record tables when it's not even a good idea to have a flat billion record table that isn't partitioned in some manner. If you want to access billions of records quickly and you are not familiar with partitioning within a DB you need to do more research. Get a baseline of how long searching a 3 billion record table takes for a normal query in your program and then define a goal and start researching how to improve your query time.
 
Why on earth would you throw away 3NF? Twitter denormalizing their DB makes sense because of the way the program functions (relatively small inserts and billions of reads). Denormalizing something like SAP would be a disaster. It goes back to each of these companies are specializing their DBs to work for their specific needs, highly transactional or highly reportable. Data warehouses have denormalized databases for decades to make reporting upon TBs of databases manageable. Paritioning, specialized indexing, etc. all help to make this possible.

Yes, I know.... I'm a staunch advocate for Codd and Kimball approaches.

I was trying to point out the folly of presuming a practice as "best" because specific large players adapted their specific implementation to it.
 
No to be a dick but I think you really need to do some testing and get a better handle on what you are trying to accomplish. [...] Get a baseline of how long searching a 3 billion record table takes for a normal query in your program and then define a goal and start researching how to improve your query time.
Indeed, regardless of what technology boy uses, he's going to have to do I/O on consumer hardware. Let's say each one of those records is about 128 bytes. A billion records is 128 billion bytes, then.

Scanning such a table with a naive implementation is going to read all 128 billion bytes. On consumer hardware, assuming the file is perfectly sequential, you'll get 50 megabytes per second. 2650 seconds, then, to scan the whole table.

Say we make a columnar store, such that only the title is stored and the whole record isn't stored. Maybe the max title is 32 bytes; 32 billion bytes total, then. We're down to 640 seconds which is an improvement, but still unacceptable performance.

A less naive implementation would make an inverted index and search for keywords, discarding noise words, and so on. Maybe we set a 16 byte limit on "words" and build a B-Tree out of that. 4096 bytes per page, minus some overhead, gives us, say, 150 words per page. If we assume there's five words per subject, that's 5 billion words for the billion row table. log[150](5 billion) is a tree height of about 5, so we will do 5 seeks to find the first instance of a word, and hten do sequential reads over that.

That's much better, but unless the chosen DBMS implements a full text search feature, it's a ton of work to implement. It's also a lot more storage; just the index would 135 gigs, and that's only single words so we've not yet implemented any proximity searching. The index constantly needs to be updated -- each removal or insert -- and will get fragmented, so it'll quickly double in size. At that point, you'll want to compress it, which means you're looking at another hour of I/O on a consumer SATA drive.

Boy will learn a lot doing this project, but it's just not realistic for consumer hardware at this time.
 
How does your application perform in terms of disk performance? Is it running on a single disk? I ask because the nature of my project means that 9 times out of 10, the end user will be running on a single-disk system with ~4GB RAM.

On a single-disk system, do multiple files perform better than a single big file?

My application runs on an embedded system with a 500 MHz processor and 32-128 MB of RAM. Typically the files are read off a CompactFlash card or external hard disk. The CPU has a maximum transfer rate of 2.4 MB/s off the CF interface, as long as I read at least 4 KB at a time (due to FAT32).

Yes, this does make it a challenge to sort through a 100 gigs of data, but I was able to pull it off with careful consideration of the hardware and file system after running some benchmarks. I'm not writing database records though. It's a type of binary data that I can perform some pre-scaling during the recording process that allows me to read different scales to get different zoom levels on the data, without having to read the entire full resolution data. This makes it feasible to show the results in a few seconds rather than forever.

Some things you need to consider are the file system and the type of media you are storing it on. For example, the file system has limits on maximum file size, etc. Files on FAT32 file system are a linked list of clusters. Therefore, smaller file sizes perform better for me. When I need to seek backwards, I believe it has to go to the beginning of the file and seek forward, which can be a major performance issue. Also, because I'm writing out multiple scales of data during the recording, if I'm not careful this fragments the disk almost perfectly (if I'm writing out files round robin in small pieces). That's something you should consider. For example, you may want allocate contiguous space for a file before it needs it. Of course, if you are on solid state media disk fragmentation isn't that big of a deal, but when you record to a traditional disk drive it can cause your program to grind to a halt, (especially if you were to fragment a file completely and seek backwards).

Anyway, write software for an embedded system that has a not so great specs and a shitty operating system with outdated memory architecture. It's a lot of fun, you can't get away with what you would on a desktop computer with 4 gigs of RAM.
 
Last edited:
As you can see from mike's post and mine, you really need to consider how to implement this so that you can intelligently find the data you are interested in. The naive solution, with this much data, is likely to take a long time. Think how you can index, partition, structure, or scale the data to reduce the time it will take to read.

For my case, my scaling made it possible in terms of performance, but cost me 15% additional disk space (which was well worth it).
 
Gentlemen, many thanks for the replies, they have been useful. I know that this project is possible on consumer hardware, because there's other software out there that does it (however badly). I posed the question because I thought that there were obvious things I could avoid, or perhaps tools that were better suited to the job than what I had in mind. :)
 
Back
Top