Big data export cost and latency

Caporegime
Joined
18 Oct 2002
Posts
32,651
I was speaking to a CEO of a company that has data we are interested in. His pricing model has big reductions if you get the data with greater latency. Real-time is about 10secs behind data production in real world but it is cheaper an hour later, cheaper again a day later, and cheaper again 1 week later, and again 1 month late

He refused to explain why claimining proprietary tech which is why they are #1 provided and recently raise 150million I'm venture funds.

I can understand a difference between real-time and say an hour later due to computation. I also know that you get cheaper computing overnight than in the day with AWS. But after that I can't think of a technical reason why pricing should decrease exponentially. I believe they use AWS.

I suspect there is simply a business reason, more into date date is more valuable, but we are trying to work a cost-plus revenue model.

To put things in perspective he is suggestion costs would be 10million USD a year for real-time but our request of 10-15k a month for older data is a level that can be discussed. We are happy with old data but get slightly better quality with newer data , but not at the exponentially costs that are frankly a wild pipe dream for us.

The company probably generates terabytes a day, so AWS pricing does have a real cost
 
They're just value pricing it. Real time data is a big shiny object that non technically folks are wetting their pants over without realising/thinking "do I need that" or what is sacrificed to make that possible (such as data validation/checking in my line of business). Sure, it's harder to do than scheduled runs but once it's working the overheads are purely infrastructure like you say.

It's also possible he's quoting propitiatory tech on the basis he can't explain a valid technical reason, wouldn't be the first or last to do it!
 
The cost of processing such a big amount of data will be very high.
If the data can be served from a cache, they'll save a huge amount.

If you want real-time data, it needs to be processed every single time. If you want hourly data, it can be processed once per hour then cached. If you want monthly data, it can be processed once per month then cached.
 
The cost of processing such a big amount of data will be very high.
If the data can be served from a cache, they'll save a huge amount.

If you want real-time data, it needs to be processed every single time. If you want hourly data, it can be processed once per hour then cached. If you want monthly data, it can be processed once per month then cached.


I get that real-time has some complexities but I fail to see how caching at daily and monthly dates can save significant costs.

I don't think the data has much processing to do. It basically has to be stored in some kind of non-relational noSQL database.

The other thing is some clients won't the data real-time so the data would only get process once even they were making use of cheaper overnight compute resources.
 
They're just value pricing it. Real time data is a big shiny object that non technically folks are wetting their pants over without realising/thinking "do I need that" or what is sacrificed to make that possible (such as data validation/checking in my line of business). Sure, it's harder to do than scheduled runs but once it's working the overheads are purely infrastructure like you say.

It's also possible he's quoting propitiatory tech on the basis he can't explain a valid technical reason, wouldn't be the first or last to do it!


I wouldn't be surprised but we dont actually need real-time feed of this data. A month old gets us 95% of the way there.

At this stage we weren't given detailed pricing for month old data.

The guy was a hot headed arrogant CEO, just landed 150million funding. We need them more than they need us so they drive the bargaining.


We deal with several GB a day and have simple in house solutions but not experienced with multi-terrabytes a day
 
I get that real-time has some complexities but I fail to see how caching at daily and monthly dates can save significant costs.

I don't think the data has much processing to do. It basically has to be stored in some kind of non-relational noSQL database.

The other thing is some clients won't the data real-time so the data would only get process once even they were making use of cheaper overnight compute resources.

There's a cost to them every time you request data. If they give you daily updates, you're probably going to request the data daily. If they give you monthly updates, you'll probably request it monthly.
The more often you access the data, the more it will cost them so it makes sense to me that they'd have a pricing structure as they do.
 
Completely depends what kind of data and what you're doing with it, eg high-frequency traders build dedicated undersea cables to shave fractions of a second...

You could definitely get a good price assuming month-old data is mostly useless to their other clients but if they've just taken a lot of funding they'll be focussed on the main aspect of their business so don't be quick to write off the CEO as it seems like it's you who doesn't understand the value or costs of what they can provide.....perhaps there's another supplier or an aggregator who might be more amenable? :)
 
They're just value pricing it. Real time data is a big shiny object that non technically folks are wetting their pants over without realising/thinking "do I need that" or what is sacrificed to make that possible (such as data validation/checking in my line of business). Sure, it's harder to do than scheduled runs but once it's working the overheads are purely infrastructure like you say.

It's also possible he's quoting propitiatory tech on the basis he can't explain a valid technical reason, wouldn't be the first or last to do it!

This. Like many products, the market value has very little to do with it's cost.
 
Likely to be value-based, but could it also be affected by something like Amazon's varying S3 Glacier storage retrieval costs?
 
Completely depends what kind of data and what you're doing with it, eg high-frequency traders build dedicated undersea cables to shave fractions of a second...

You could definitely get a good price assuming month-old data is mostly useless to their other clients but if they've just taken a lot of funding they'll be focussed on the main aspect of their business so don't be quick to write off the CEO as it seems like it's you who doesn't understand the value or costs of what they can provide.....perhaps there's another supplier or an aggregator who might be more amenable? :)

The data does have super low latency even when requesting real-time, it is 15-seconds old.

While the real-time data does likely have more value to more customers, for us there is much less difference in value. And while market conditions may dictate general pricing, our discussions revolved around a Cost Plus model, with the plus mostly being equity, so we would hope prices quoted were close to true operating costs. So the value of data to the market shoudln't be a consideration, only the costs, and yes, we don't fully understand how costs could exponentially decrease in time which is why i asked.

There are other data suppliers but it is a complex issue, lots of companies too worried about data privacy, others see us at competitors, others we have a partnership with already but politics and attorneys are slowing down discussions to a crawl.
 
Likely to be value-based, but could it also be affected by something like Amazon's varying S3 Glacier storage retrieval costs?


yeah, I expect something like AWS S3 glacier is used to get cheaper storage for older data but that amount to about 5x cost difference, while the costs suggested were more like 100-200x cheaper at a month old which sounds much closer to valuation than operating costs.
 
yeah, I expect something like AWS S3 glacier is used to get cheaper storage for older data but that amount to about 5x cost difference, while the costs suggested were more like 100-200x cheaper at a month old which sounds much closer to valuation than operating costs.

Try calculating it in cost per request:

Monthly = 1 request per month
Daily = ~30 x more requests
Hourly = ~720 x more requests
"Real Time" (every 10 seconds) = 250,000 x more requests.

The cost is on an exponential scale because the time differences (hourly, daily, monthly) are on an exponential scale?
 
Try calculating it in cost per request:

Monthly = 1 request per month
Daily = ~30 x more requests
Hourly = ~720 x more requests
"Real Time" (every 10 seconds) = 250,000 x more requests.

The cost is on an exponential scale because the time differences (hourly, daily, monthly) are on an exponential scale?


The number if requests increase but the amount of data decreases at the same rate. While I understand there may be some cost in exporting the data the main costs on say AWS is just per GB sent out which doesn't change.

At the end of the day it doesn't matter too much for us, monthly is sufficient short term and longer term we can see about a new deal.

I was just mostly interested if anyone has experience with very Big data and could provide insights why export costs would decrease exponentially. The number of requests might be part of it. The main issue was he was just not willing to explain any reasoning why which then makes it hard for us to have faith in the quotes that are supposedly Cost Plus
 
It's about how difficult it is to provide. If real time is the same price then most customers will choose that. That means you've got no time to move the data elsewhere before serving it to customers. That'll have a cost. It also means if something goes wrong, everyone is immediately affected, rather than having a day or so to sort it out.
 
The data does have super low latency even when requesting real-time, it is 15-seconds old.
.....
While the real-time data does likely have more value to more customers, for us there is much less difference in value.
.....
So the value of data to the market shoudln't be a consideration, only the costs
....
, and yes, we don't fully understand how costs could exponentially decrease in time which is why i asked.
Without knowing the other half of the story it's impossible to say but it's blindingly obvious that real-time data for stock traders is incredibly valuable, whereas data from even an hour ago is probably useless.

Or for the F1, the weather in 2 minutes' time is incredibly important, but the weather for yesterday or tomorrow is pointless.

There's also value in people knowing what this firm thought was the truth at time X, so there's all kinds of value options.....

If what they're selling isn't an easily-available commodity then you can't just rock up and ask for it at "cost price" because "cost price" is completely irrelevant - and thinking the cost is just the bandwidth is surely offensive to them as presumably work goes into building whatever it is. The cost of building a website can be 50k but the monthly bandwidth bill could be a fiver!

I don't want to be mean, but either negotiate a better deal (without whining to them that it doesn't have value X and you want it at 'cost price' - pick a different tactic), or hand this job over to someone in your organisation with more commercial experience?

Best not bad-mouth the CEO on a public forum though because the world is surprisingly small!

(No I'm not the CEO, I've not got the foggiest what this is about :p)

Good luck!
 
The number if requests increase but the amount of data decreases at the same rate. While I understand there may be some cost in exporting the data the main costs on say AWS is just per GB sent out which doesn't change.

They need to pay for compute time on AWS too.
You say that the amount of data transferred decreases if you make more requests, so presumably somebody is storing the last requested time/id (do you send them that info with the request or do they store that for you?)
So each time you request data, it'll involve:

-Normal processing/routing of incoming requests.
-Query metadata database to check your account is valid and find the last time you requested data
-Query main data to check for any changes since last time you requested it
-Package data and return it to you

Even if there was no changed data to return, there would still be CPU time and queries run every request you make. The probability of a large number of requests coming in at the same time increases (exponentially?) as the number of requests increases so they'll need a lot more compute power on-hand to deal with the peaks.

I'm sure there will be more to it than purely the cost of running the service, too. I think it's a bit ironic that you're calling their CEO arrogant whilst demanding they give you their service at cost price? :confused:
 
With a totally non-tech head on, I imagine this is purely value. Some of our services are more value driven than cost and a very good analogy is something like Getty.

A photo that was taken 5 years ago might cost £10 to use for editorial, £50 for low res web, £100 medium res, £200 high res, £500 large scale print use, £1000 TV broadcast, etc. Same photo, same work that went into it, just a different resolution, file size and usage, all of which are minimal or no additional cost whatsoever. You're paying for value.
 
Is the weekly/monthly etc data as granular as real-time?

I'd suspect that if the main aim of their business is supplying real-time data to their clients, after probably an hour the data is likely now worthless to them, and they probably only store it for archival/record purposes. If you come along and are now interested in the 'old' data, then this company will see another revenue stream, even if it's considerably smaller than usual.
 
They need to pay for compute time on AWS too.
You say that the amount of data transferred decreases if you make more requests, so presumably somebody is storing the last requested time/id (do you send them that info with the request or do they store that for you?)
So each time you request data, it'll involve:

-Normal processing/routing of incoming requests.
-Query metadata database to check your account is valid and find the last time you requested data
-Query main data to check for any changes since last time you requested it
-Package data and return it to you

Even if there was no changed data to return, there would still be CPU time and queries run every request you make. The probability of a large number of requests coming in at the same time increases (exponentially?) as the number of requests increases so they'll need a lot more compute power on-hand to deal with the peaks.

I'm sure there will be more to it than purely the cost of running the service, too. I think it's a bit ironic that you're calling their CEO arrogant whilst demanding they give you their service at cost price? :confused:

Of your list, only the last point, actually sending data to us would have any real computational cost though. I'm mean stuff like checking our credentials is meaningless compared to storing and sending us terrabytes of data. The data is entirely static, so once processed or provided doesn't change. The raw data probably requires some kind of filtering but that would have to be done for all clients and only upon reception.


I can appreciate that if there are multiple requests at peak times this will increase peak computation handling different queries on the data, which is liekly why real-time export is so expensive. Weekly or monthly data can be processed overnight on cheaper AWS spot instances.

So it is probably a combination of cheaper storage costs for older data and reduce peak query load.


As to pricing, well simply we don;t have the capital to pay their market rate for the data so ideally we have a revenue share model and wouldn't pay a dime until we have paying clients but since they incur operating costs then a cost plus model makes more sense. This is pretty standard, several of are clients run this kind of pricing model with the data we provide.
 
Back
Top Bottom