Big data export cost and latency

D.P. · 13 Feb 2019 at 20:13

I was speaking to a CEO of a company that has data we are interested in. His pricing model has big reductions if you get the data with greater latency. Real-time is about 10secs behind data production in real world but it is cheaper an hour later, cheaper again a day later, and cheaper again 1 week later, and again 1 month late

He refused to explain why claimining proprietary tech which is why they are #1 provided and recently raise 150million I'm venture funds.

I can understand a difference between real-time and say an hour later due to computation. I also know that you get cheaper computing overnight than in the day with AWS. But after that I can't think of a technical reason why pricing should decrease exponentially. I believe they use AWS.

I suspect there is simply a business reason, more into date date is more valuable, but we are trying to work a cost-plus revenue model.

To put things in perspective he is suggestion costs would be 10million USD a year for real-time but our request of 10-15k a month for older data is a level that can be discussed. We are happy with old data but get slightly better quality with newer data , but not at the exponentially costs that are frankly a wild pipe dream for us.

The company probably generates terabytes a day, so AWS pricing does have a real cost

D.P. · 15 Feb 2019 at 20:14

touch said:
The cost of processing such a big amount of data will be very high.
If the data can be served from a cache, they'll save a huge amount.

If you want real-time data, it needs to be processed every single time. If you want hourly data, it can be processed once per hour then cached. If you want monthly data, it can be processed once per month then cached.

I get that real-time has some complexities but I fail to see how caching at daily and monthly dates can save significant costs.

I don't think the data has much processing to do. It basically has to be stored in some kind of non-relational noSQL database.

The other thing is some clients won't the data real-time so the data would only get process once even they were making use of cheaper overnight compute resources.

D.P. · 15 Feb 2019 at 20:23

Itchytrigg said:
They're just value pricing it. Real time data is a big shiny object that non technically folks are wetting their pants over without realising/thinking "do I need that" or what is sacrificed to make that possible (such as data validation/checking in my line of business). Sure, it's harder to do than scheduled runs but once it's working the overheads are purely infrastructure like you say.

It's also possible he's quoting propitiatory tech on the basis he can't explain a valid technical reason, wouldn't be the first or last to do it!

I wouldn't be surprised but we dont actually need real-time feed of this data. A month old gets us 95% of the way there.

At this stage we weren't given detailed pricing for month old data.

The guy was a hot headed arrogant CEO, just landed 150million funding. We need them more than they need us so they drive the bargaining.

We deal with several GB a day and have simple in house solutions but not experienced with multi-terrabytes a day

D.P. · 18 Feb 2019 at 08:36

Beansprout said:
Completely depends what kind of data and what you're doing with it, eg high-frequency traders build dedicated undersea cables to shave fractions of a second...

You could definitely get a good price assuming month-old data is mostly useless to their other clients but if they've just taken a lot of funding they'll be focussed on the main aspect of their business so don't be quick to write off the CEO as it seems like it's you who doesn't understand the value or costs of what they can provide.....perhaps there's another supplier or an aggregator who might be more amenable?

The data does have super low latency even when requesting real-time, it is 15-seconds old.

While the real-time data does likely have more value to more customers, for us there is much less difference in value. And while market conditions may dictate general pricing, our discussions revolved around a Cost Plus model, with the plus mostly being equity, so we would hope prices quoted were close to true operating costs. So the value of data to the market shoudln't be a consideration, only the costs, and yes, we don't fully understand how costs could exponentially decrease in time which is why i asked.

There are other data suppliers but it is a complex issue, lots of companies too worried about data privacy, others see us at competitors, others we have a partnership with already but politics and attorneys are slowing down discussions to a crawl.

D.P. · 18 Feb 2019 at 08:40

kaku said:
Likely to be value-based, but could it also be affected by something like Amazon's varying S3 Glacier storage retrieval costs?

yeah, I expect something like AWS S3 glacier is used to get cheaper storage for older data but that amount to about 5x cost difference, while the costs suggested were more like 100-200x cheaper at a month old which sounds much closer to valuation than operating costs.

D.P. · 19 Feb 2019 at 14:59

touch said:
Try calculating it in cost per request:

Monthly = 1 request per month
Daily = ~30 x more requests
Hourly = ~720 x more requests
"Real Time" (every 10 seconds) = 250,000 x more requests.

The cost is on an exponential scale because the time differences (hourly, daily, monthly) are on an exponential scale?

The number if requests increase but the amount of data decreases at the same rate. While I understand there may be some cost in exporting the data the main costs on say AWS is just per GB sent out which doesn't change.

At the end of the day it doesn't matter too much for us, monthly is sufficient short term and longer term we can see about a new deal.

I was just mostly interested if anyone has experience with very Big data and could provide insights why export costs would decrease exponentially. The number of requests might be part of it. The main issue was he was just not willing to explain any reasoning why which then makes it hard for us to have faith in the quotes that are supposedly Cost Plus

D.P. · 22 Feb 2019 at 11:20

touch said:
They need to pay for compute time on AWS too.
You say that the amount of data transferred decreases if you make more requests, so presumably somebody is storing the last requested time/id (do you send them that info with the request or do they store that for you?)
So each time you request data, it'll involve:

-Normal processing/routing of incoming requests.
-Query metadata database to check your account is valid and find the last time you requested data
-Query main data to check for any changes since last time you requested it
-Package data and return it to you

Even if there was no changed data to return, there would still be CPU time and queries run every request you make. The probability of a large number of requests coming in at the same time increases (exponentially?) as the number of requests increases so they'll need a lot more compute power on-hand to deal with the peaks.

I'm sure there will be more to it than purely the cost of running the service, too. I think it's a bit ironic that you're calling their CEO arrogant whilst demanding they give you their service at cost price?

Of your list, only the last point, actually sending data to us would have any real computational cost though. I'm mean stuff like checking our credentials is meaningless compared to storing and sending us terrabytes of data. The data is entirely static, so once processed or provided doesn't change. The raw data probably requires some kind of filtering but that would have to be done for all clients and only upon reception.

I can appreciate that if there are multiple requests at peak times this will increase peak computation handling different queries on the data, which is liekly why real-time export is so expensive. Weekly or monthly data can be processed overnight on cheaper AWS spot instances.

So it is probably a combination of cheaper storage costs for older data and reduce peak query load.

As to pricing, well simply we don;t have the capital to pay their market rate for the data so ideally we have a revenue share model and wouldn't pay a dime until we have paying clients but since they incur operating costs then a cost plus model makes more sense. This is pretty standard, several of are clients run this kind of pricing model with the data we provide.

D.P. · 22 Feb 2019 at 11:23

Russinating said:
With a totally non-tech head on, I imagine this is purely value. Some of our services are more value driven than cost and a very good analogy is something like Getty.

A photo that was taken 5 years ago might cost £10 to use for editorial, £50 for low res web, £100 medium res, £200 high res, £500 large scale print use, £1000 TV broadcast, etc. Same photo, same work that went into it, just a different resolution, file size and usage, all of which are minimal or no additional cost whatsoever. You're paying for value.

Totally get that, this specific data in infinitely more valuable real-time then even an hour old. But we are trying to work out a pricing model that is based on operating costs and not value to allow us to develop our market and clients, at which point market value is easy for us.

D.P. · 22 Feb 2019 at 11:24

Semple said:
Is the weekly/monthly etc data as granular as real-time?

I'd suspect that if the main aim of their business is supplying real-time data to their clients, after probably an hour the data is likely now worthless to them, and they probably only store it for archival/record purposes. If you come along and are now interested in the 'old' data, then this company will see another revenue stream, even if it's considerably smaller than usual.

Yes, same granularity. Potential that they can compress data beter once it is no longer real-time.

As mentioend in post above, most cleints have much more valuer for the real-time but the older data is not worthless and isn't purely archival.

D.P. · 22 Feb 2019 at 11:59

touch said:
You said the amount of data transferred would reduce if the amount of requests increased.
You either need to send the entire dataset at each request (which is a lot of bandwidth) or you need to filter the dataset to select only the new data since the last request (which is processing time). You can have it both ways.

The data is requested by timestamps, so there is some filtering required but the data should already be indexed by timestamp. At least it would be odd if it wasn't. But either, they don;t need to store anything about prior requests, each data export would just be given updated query times.

Your "cost plus" model might make sense to you but it doesn't make any sense for them.
They obviously have clients who do pay the market rate for the data - why would they risk upsetting those clients by giving you a much cheaper price?

Firstly, it is not our pricing model, this is their pricing model they are proposing to us so of course it makes sense to them.
Secondly, What their other clients pay is largely irrelevant. The value oif the data is whatever the client finds value in, and different clients in different industry will have vastly different values and so their cost will change.

Thirdly, this is entirely irrelevant to the discussion and is far more complex than can be discussed here, and is under NDA so is entirely pointless to speculate. I am only interested in technologies and services for big data achievable and export and how this relates to the cost. I have had some reasonable advice that would certainly explain a good deal of the cost differences.

D.P. · 5 Mar 2019 at 16:39

Beansprout said:
Dear Amazon, 1Gb network transfer only really costs fractions of a penny and all I'm doing is streaming cat videos so please sell it to me at that cost price for that bandwidth, not your list price of 10p, because I don't place any value on all the kit you invested in to deliver the bandwidth in the first place.

At some point you have to admit defeat here....

More irrelevance, and as it happens we get over 100K's worth of AWS for free.

You obviously have very little understanding of how companies actually negotiate.