WNiCF - Interview with Dimiris Perdikou - Gamification and public sector Artwork

What's new in Cloud FinOps?

Stephen Old and Frank Contrepois get together to discuss what's new in the world of cloud when it comes to FinOps. There are two monthly episodes, one where we'll discuss the top stories we've found from this month and a second episode where we bring in a friend of the show to talk to us about a topic of their choosing.

All Episodes

What's new in Cloud FinOps?

WNiCF - Interview with Dimiris Perdikou - Gamification and public sector

February 06, 2024 • The FinOps Guys - Stephen Old and Frank Contrepois • Season 5 • Episode 2

0:00 | 40:42

Send a text

This month Stephen and Frank are interviewing Dimitris Perdikou from the UK Home Office.

The topic is Gamification, but, as often, it grew beyond its original scope. Dimitris made us feel good with the work done making sure taxpayer money is well spent.

In the process, we discussed two new metrics for unit economics specific to the public sector.

You can reach us at: podcast@finopsguys.com

SteveO (00:19.284)
Hello everyone and welcome to what's new in cloud. Today we've got an interview episode with myself, Steve Oh, and my good friend.

Today we're going to talk about something that's really interesting and always kind of an exciting topic whenever we speak to people about finance, we're going speak about gamification. And to do that, we've actually gone and found someone that knows something about it. So you're not just listening to me and Frank, we've got our new friend, Dimitris Perdicu. Dimit, introduce yourself.

Dimi (00:46.637)
Thanks and thanks for having me on. So yeah, I'm Dimitris Podikou. At the moment, I'm the Chief Technology Officer of Data Analytics at the Home Office. I've been at the Home Office just over three years, leading some of the, part of my current job role, obviously involved in FinOps. And I'm great to meet you guys at the FinOps Foundation event at the Royal Institute just before Christmas, which was great venue and good to be a wider FinOps community. And then...

Frank (01:10.548)
Yeah.

SteveO (01:13.588)
What a venue!

Dimi (01:14.765)
Yeah, it was awesome. I didn't realize the acoustics in there were amazing. But yeah, so I've been doing my current work about nine months. Before that, I was leading the Migration on Borders Cloud Platform in the Home Office, which is one of the largest cloud platforms in the country and public sector and also across Europe. And we had big kind of fin -ups initiatives in there as well. So good to be on to kind of talk through some experiences we had and hopefully some of the mistakes we made as well to have our people can learn from it.

SteveO (01:43.406)
What that was, what was so interesting when you were, you were talking at the, the event was that you mentioned that you'd done deification and then you actually learned and how to change it. But just going back to what you were saying and you know, I've known you for a little while and know a bit about your background, but how did you find, or maybe a better question is when did. Phenops start.

in your role because obviously you're doing the migration. Was it something you thought about ahead of time? Was it something that came about when you suddenly saw how much money you were spending? How did that start?

Dimi (02:16.941)
Yeah, I think it was a pretty typical journey of we're going to migrate. And thankfully, it wasn't a lift and shift. It was a transformation at a time to microservices and containers as well. But in kind of typical cloud fashion, we kind of would leave cost aside. We'll just work that out later down the line until we're about, I think about 12, 18 months in, and costs start skyrocketing. And we can see it just going up massively. So suddenly we're thinking, right, how do we get on top of the costs, do things better? What opportunities do we need to identify?

So, but then just suddenly very quick, a few of us kind of skilled up. But the good thing about a lot of cloud providers, it's all there to see, right? Like you either pay less by using less, or you find other mechanisms, whether it's through some sort of enterprise agreements or reserved instances, all those kinds of things. So a quick kind of like, right, here's all the list of things we can do. How do we go about doing it? And then start trying to be quite pragmatic as well. That's the other thing, the biggest thing I've learned is just be pragmatic in all these things. You're never going to be.

Frank (02:56.694)
Yeah.

Dimi (03:16.017)
perfectly optimizes it. So think about what we're doing, do them quickly. And I think the reason we can't stop it on gamification is because we were using a lot of containers at the time and we're like, how do we try and work out how we use them less, particularly overnight? Because most of the teams work during the day, I we kind of went stick on an eight to six, maybe some people start and finish and that kind of thing. I mean, to work out quickly who's...

leaving stuff on overnight rather than shutting down the containers. So we did a really naughty thing actually. So we've got one of our guys that was managing the Kubernetes cluster at the time. We basically went to him, how quickly can you tell us like tomorrow who left their stuff on overnight? And it's like, well, I don't really know. Can I work it out, work out the right hours, all this kind of thing. And in the end we came to agreement, it's basically he's going to run a script at two in the morning and say, what's running, output that to Slack and go, this was running. And we purposely didn't tell all the teams that that's.

how simple it was so they wouldn't start trying to play the system a bit. Eventually some of them found out. But it was like, it's easy for that, it's easy. Look at what's running two in the morning, run the script, done and over. Output Slack, here's all the people, here's what you're running, here's the nameplates that are running in, this is how much CPU memory you're using. And that was our kind of how quick you're going to be. You're going to be seen on this channel as the team that's top of the chart because it was obviously sorted by who was using the most at that time. And basically when you left your stuff on overnight.

Frank (04:15.638)
You

Dimi (04:40.809)
Why did you leave on overnight? Go inside down for tomorrow.

SteveO (04:44.11)
And was this just in development? Or was this across the board?

Dimi (04:49.127)
So it was a development of the test environments initially. Production, obviously not going to shut down production overnight. All the systems are 24 -7. But later on, we moved on to other initiatives like making things, tweaking scaling criteria so people weren't oversized in production. And then further down the line, we became really big users of Spot. And when I first started to try and understand Spot and seeing how we could use it, I thought from an engineer's perspective, it was absolutely amazing. It really proves.

that your application can handle it. And I was quite risk averse, generally, with civil service. But I managed to get leadership on board to do it in production as well. So we had like half how containers on spot and half not on spot. So at least we have a bit of a backup all the time. And in the end, to me, it proved some real valuable lessons in production of workloads as well. So yeah, that worked in spot really worked in saving us most of my interventions. Still is the biggest cost savings. If you can make it work, it seems to one of the biggest cost savings you can make.

SteveO (05:21.614)
Mm -hmm.

Dimi (05:49.406)
might not even reduce that.

SteveO (05:51.598)
But you could do that because you transformed right, because you didn't lift and shift and you actually looked after the applications, worked out the right applications and that kind of stuff. And that's, it feels where so many people are going wrong.

Frank (05:55.158)
Alright.

Dimi (06:03.204)
Yeah, particularly on the spot side of things, part of the time we were running like these massive Kubernetes clusters in the spot, some teams barely saw a difference in the end on the spot because there was just so many nodes and man -of -hate ability that if a couple of nodes got pulled by Amazon, there's just plenty of capacity to quickly spin it up. And we learned our lessons in that as well about why they're used and what instance types. But yeah, if we were just traditional EC2 instances and they didn't scale and things like that, that just wouldn't be impossible.

Frank (06:30.294)
But do you so you talk to us. So that's an interesting point about spot is that normally if it is the what's left from AWS from what's available at the moment that's not required by other customer which are on demand. On the recall, usually many recall for specific set of generation, for example. So you're using lots of R6I brilliant. And then all of a sudden AWS needs.

all the arsic size or a lot of arsic size. Are they going to terminate all the spot instances on that same generation or is it more scaled?

Dimi (07:05.252)
Yeah, so interestingly, it's changed over time. There used to be this whole bidding thing about how much you bid and what your price is, but that's all been removed now. So they just take a selection of people who got those instances and put them in the back. But this is where initially I think we were on, I think we're just on M, so just the generic type, right? And quickly we kind of went, even if these are instance types, it kind of squeaked more to CPU and memory or eventually even some of the other instance types.

SteveO (07:09.964)
Yeah.

Frank (07:11.572)
Yep.

Dimi (07:33.828)
If we're saving 80 % of the cost of them, then it's still worth using them if we don't necessarily need that exact capability because it's 80%. It could be 10 % more, but 80 % save for the cost. So it still makes sense. And then we started working very closely because of the size of our state. We soon found out that our AWS County, we kept going, can you just let us know how much you're going to grow? Because we've got some people in the data center going, what have you guys decided to give today? Or what are you on today? We're just happy to, in return, they're going to give us the best.

Frank (07:39.83)
Oh yeah, absolutely.

Frank (07:54.164)
Hahaha

Frank (07:57.782)
I know.

Dimi (08:03.268)
better guidance on running Spark. And that's where we really started expanding the port of instance types we had as well. We went originally from like a few M's and C's and R's and then into also about the wild ones as well that worked. The only thing we never quite got to in that space is shift to Gravitron because I was a bit too worried about compiling issues and things like that. But yes, expanded it to loads of different instance types. And in the end, what you find is that, although you might, even if you lose quite a few of an instance types,

We've got plenty of others. The others spin up really quickly. We went into another phase of how quickly do our new nodes spin up. I think we did a little bit of performance tuning from about three minutes down to a minute or something in the end. But for most applications, it wasn't much because it's just the underlying node. And then, again, so it's on the engineering, kind of finding out how things work. So we found some of the containers take even longer than that because of all kind of funkiness applications are thrown together in the containers. So they quickly found that they've got to tweak that as well.

Frank (08:34.998)
OK, that's.

SteveO (08:35.34)
Yeah.

SteveO (08:44.32)
Wah.

Frank (08:44.82)
Yeah, well...

Frank (09:01.942)
Yeah, plus you have 50%, from what I understand, which are on demand. So you'll never get a drop to zero. Even if they were to remove all the instances, it would take you one minute to bring them back on another instance type, while the 50 % maybe struggles a little, but yeah, still delivers.

SteveO (09:05.774)
Yeah, you got that protection, haven't you?

Dimi (09:06.05)
Yeah.

Dimi (09:19.46)
Yeah, this is where, again, if trying to have a thing, so if you drop to 50 % by the heavily utilized, then you find the application starts struggling because everything starts maxing out. And then initially we only, we basically had everything busy with 50 % spot, 50 % on demand. We hadn't really counted to the scenario where you lose all the spot and you can't get any other instance types at all. So then eventually we built in mechanisms where if you can't get the spot, then it realizes and starts being more and more on demand. But you know, like where we've done a lot of this, like Amazon did.

Frank (09:38.26)
Yep.

SteveO (09:38.574)
Hmm.

SteveO (09:42.988)
It builds on demand.

Dimi (09:47.428)
the mechanisms have all developed over time. Amazon have made it even easier to adjust what you want to do. Like when I first did this six years ago -ish now, we really struggled with a lot of custom script built in, but now they've got loads of extra support to try and make it as easy for you to as possible.

SteveO (10:05.71)
So you've not gone Graviton there understandably for the compiling issues. Have you popped in Graviton for your RDS and you know, the managed services?

Dimi (10:14.18)
Yeah, definitely. Ardius was an easy one and part of the info of Ardius, right? Just switch them all over to Gravitron, bit of a cost saving, extra performance, easy win. And we're also finding ourselves moving to more serverless as well. It's still a bit of a complicated kind of financial case for is this going to cost us more or less in infrastructure and how do you balance kids that have been in the amount of time to use for people's time as well, particularly when you're kind of having either perms or supplies development, some of that.

SteveO (10:37.038)
Do it.

Dimi (10:43.108)
But yeah, all of our ideas moved to Gravitron as well.

SteveO (10:46.414)
We had, we had, um, on episode three. So that's over three years ago. Now we had Dario on and he had done a piece of work and organization who took a app and they broke it down into, was it Python first Frank? And it'd gone from about 30 K down to three K a month.

Frank (11:00.502)
Yeah, so it was Lambda. They were already using Lambda. They were using Python. And what he came in, he rewrote them in Go, compiled. And he said that you reduce by 10, again, the cost. Yeah.

SteveO (11:06.882)
Yeah.

SteveO (11:13.998)
Yeah. And 10 times faster as well. But it was a real battle on the business case because you're already down, you know, it used to be 30 down to three K already. You're feeling great about that. This is going to take it to 300, but the people time, et cetera, is a balancing act, but it, you know, it proves really powerful in the end, but it is, it's a difficult one. So many people don't even get that first step done onto some containers or into serverless, right? So, um, it's, it's impressive.

Dimi (11:40.164)
Yeah, I think that pragmatic use case right for what we're doing all the time, like when we're going from no financial journey to doing something is huge. It's really easy wins. But in after a year or two years of doing it, you suddenly, I kind of get, I put bigger pressure on the teams to say, can you just do a naughty business case to go, I think it would take us about a month's worth of work, cost of the business or our time is about this much. And we reckon we'll save this much. So we'll

SteveO (12:05.326)
to do.

Dimi (12:09.538)
break even and start saving money in two months, three months, six months. Okay, that makes sense. And some of the cases then come together and go, we'll break even in like three or four or five years. And like, we're going to change everything probably again by then. So yeah, probably maybe isn't worth it. It doesn't have to be exact. Some people you present that challenge to and they're suddenly like, well, I don't know the estimate. Maybe it'll be five days. Maybe it will be 10 days. Maybe it will be a month to do it. And then they're going, oh, but what cost we do if it's me doing it? Is this much? I'll give it to the server person team. Is this much? And

SteveO (12:20.046)
Yeah. Yeah.

Frank (12:31.766)
Ha ha ha.

SteveO (12:33.134)
Hehehehe

Dimi (12:39.236)
Just keep it simple, put some estimate maps, give me a number and don't overthink it. I had this get actually recently in my latest statement when we're talking about reserved instances, because it's always a challenge to work out exactly what you want to do. Unless you're perfectly right size and you're shutting down everything in non -prod, and then the perfect case would be in production, you're going to have the exact size instance or bigger for the next year, then you can go ahead and do it. If you don't, you're going have this weird balance plus quite a large ecosystem.

SteveO (12:41.422)
Just give me a number. Yeah.

Dimi (13:08.836)
And then from the latest from this Pinov's team is we were trying to work out where to focus things on. And we had a list, we knew we had like a hundred things we could do in the estate to save money. So sat down one day with the resultances and they started going through this massive spreadsheet of all the ideas and teams they thought might resize or might change instance size and this, that and the other. In the end, we just said, we could spend the next month going through this, or you could pick the stuff you know is pretty close or about right or reserved.

60, 70 % of that size for the estate. We'll do that in a day and we can move on to the other stuff. And it's not perfect. We could save more money, but we've just saved ourselves 30 % of all that stuff we did reserve and we'll move on to the next thing. And maybe next year when we come back to it or when we're starting to try and do reviews like quarterly or more stuff and have what's changed in the estate, then we'll go, oh, you know what? There's some smaller, only smaller stuff to save in our initiatives now. We'll go back over this and see if we can tweak this a bit further.

But I think that kind of pragmatism is so, you need to have that in the Phenop journey because it's so easy to go, this is the perfect scenario. I want to save all this money and then get lost in that and just go down a rabbit hole for months on end.

SteveO (14:19.95)
Yeah, it's, um, pragmatism is such an important point in life. I've never quite thought of it this way. You've kind of made some light bulbs here, but I always use say people just do something. You know, that's, that's the first step. And, you know, me and Frank spent a lot of time and commitments, you know, we've, when we worked together, that was kind of what we did. Yeah. Um, but there is, you know, the best value return is still generally.

Frank (14:36.342)
Well, that's what I do for a living.

SteveO (14:45.622)
reserved instances on some things. We're big fans of convertibles and like doing convertible strategies for the more wavy stuff, because you can split squash and stretch. That's terminology that probably people aren't going to understand. That's just what we say. But maybe one day we'll do an episode explaining that.

Frank (15:00.118)
Exactly, we'll explain that that bit. But yes, I was, I find it very interesting. So when we had a, I think it was some, well, weeks ago, we had a chat of the summary of 2023. And with Luca Paratore. And one of the things that came out was, oh, some company are doing repatriation. So which is moving back to their centers. So in your case,

SteveO (15:05.902)
It's a good one.

SteveO (15:15.47)
Mm -hmm.

Frank (15:28.118)
Can you say today that it is way, way, way better than it was before moving to the cloud?

Dimi (15:34.372)
Yeah, well, funny enough, I'm actually in the Amazon offices today doing a workshop with what the government about and just got out of a session about what does good cloud look like and trying to find that is really hard because there's so many different metrics to look at is that we can do things faster. Some of the systems we've delivered now we've turned around and one of the fastest ones was in five days from idea to inter -production. If you're in a data center, that would not have happened there. I've never done the maths on things. Sometimes our finance teams can work with...

maybe micro obsessed on what was the cost of just servers on -prem versus the cloud. But it's a really complicated metric, particularly once you start looking at managed services, it's just RDS. You can't just look at the cost of a server of that because we no longer have DBAs doing backups and failovers and stuff like that. It's just basically all built into the cost. And know Amazon can do it and probably do it better because they've got all their specialists doing it as well. So we still have in government some very specific use cases that need to be, that will be run data centers for a while.

But everything we can move over is pretty much moving over. And that's why government's got the cloud first strategy. Just use it as much as possible to get the benefits of it. But while I'm now working with the wider cabinet office to try and work out what does good look like so we can start trying to promote that more publicly, because there's some great examples in the public sector. And as well as trying to measure other departments so we can try and go, well, look, this is good. You haven't just done your lift and shift. But have you thought about doing all these other things that you can?

potentially could make it even more cost effective.

Frank (17:04.982)
Yeah, and I think you touched the right point, which is lift and shift. You've made the move, thinking and leveraging cloud from day one. It was part, and you've kept what needed to stay in the data center for now, in a data center, which was the main thing we're seeing is, is yeah, usually people will move everything, but they'll do a lift and shift because it needs to be done before the end of the lease, which is by the way, three months time.

SteveO (17:14.894)
what it's for.

SteveO (17:30.358)
Yeah.

Dimi (17:31.464)
Thank you.

SteveO (17:33.198)
What, with the pragmatism you've now learned and you were saying, you know, you kind of got that build shock halfway through and then suddenly had to start thinking about finance and Finops. Is there anything you wish you'd have done sooner? You know, when you first started that journey with all the knowledge you've gained now, is there anything you'd have done differently? Like what would be your first steps now when you first got that, Oh no, we need to do something about our costs.

Dimi (17:58.86)
Yeah, I think some of the advanced things I've stated away from my spot instances and things like reserved instances are hard to know what to even buy into. I think knowledge, knowledge is key, right? And I'm still finding that now is that you can get a bit more knowledge out there than everybody who's considering it. So at the time, once we started developing our strategy, we put together a little internal pack, which is basically just taking some of the good examples, but bringing to life some of our specific services.

And then we do a lunch and learn once a month, get all the new people along, just an hour. And suddenly everyone goes away from that. They're not experts, but they're suddenly thinking about that all the time. So picking that up, because then you just start getting everyone involved in it. So if they were starting that cloud journey, if everyone's just thinking about it, then maybe they start thinking, you know what, maybe I do use the smallest instance type problem, just having to estimate, I think it's about this, but then suddenly find out we're massively oversized from there.

Frank (18:34.742)
Yep.

Dimi (18:55.684)
So that's, I think that's one of the things. The other thing is just trying to make use of services, whether it's standard things scaling up and scaling down for an immutable instance rather than a whole pet versus cattle thing, or making use of RDS instances rather than build your own database, those kinds of things. So really think about making use of those at a higher level services.

SteveO (19:18.626)
Fantastic. Yeah. So education and using cloud for what clouds for really the two key things. And you can't do Phinups on your own, right? You can't bring costs down as a one man army. So the education piece is exactly right. You need those people to go away into their parts of the organization and be thinking for themselves and how they do it. How do they think with cost in mind more? I think in the public sector, it's an interesting one because in some organizations, people feel very removed from the money.

Frank (19:22.774)
Yeah.

SteveO (19:46.382)
I worked with a big insurer, for instance, and actually people spending the money didn't feel like their money didn't really care. The people who looked after the money were over there. In public sector, you do have a slight benefit of it kind of is your money, right? It's your tax money. So actually getting people to take action might be that a bit easier, people a bit more conscious of it. It's a bit of an advantage.

Dimi (20:08.676)
Yeah, definitely. So, and we always, there's always a continued presence of is this good value for money to the taxpayer? One of the great examples of that is someone said to me once, if you go away and work out how much tax does the average taxpayer pay? And then relate everything to how many people's worth of tax is this going to cost us? It really brings it to life. And it's something you go, this was 10 ,000 people's worth of tax. Do we really want to do this? Is there a better way of doing this? Maybe we should consider that. If it's...

Frank (20:30.472)
Yep.

Dimi (20:37.186)
one person's worth of tax. It's still important, but the overall impact of it isn't as much.

SteveO (20:43.022)
I mean, what an interesting form of unit economics. That's exactly what it is, right? Absolutely fantastic. Yeah. How many, how many, how many customers, how many taxpayers, how many taxpayers taxes is each? Yeah, I think that's fantastic. I've never thought of that before. Um, and I do quite a bit in the public sector. That is definitely stolen from my repertoire. Um, should we talk a little bit about gamification and you briefly mentioned earlier, kind of how you, how you started, but do you want to tell us in a bit more depth kind of, uh,

Dimi (20:46.02)
Yeah.

Frank (20:46.198)
Yes.

Frank (20:59.67)
Ha ha ha.

SteveO (21:12.)
what happened next or what you learned that did or didn't go well with your kind of showing people who was on at night and how it moved on.

Dimi (21:20.26)
Yeah, so initially we were just like, how do we get things shut down at night time? So we said earlier, we run that script two in the morning, post it in there, see who's top of the charts, the top few people, get someone to post in the morning, go and link the right people from those teams. And why was it on? Can you shut it down tomorrow night? And you get the right people in there, they start shutting it down. And then slowly everyone tries to stay off that list. And we're all like, yeah, this is a game of Cajun. People don't want to do it. It's really encouraging them not to.

to not be on the list. But what we soon realized is that it also kind of create this sense of like fear of like, I don't want to be on that list and negativity around it because we're using the negative gamification around that, right? It's like, you don't want to be on the list rather than people want to be on the list because there is something they're proud of when I show off around it. So although it works, I don't think it necessarily became the right way of doing it. So we tried to...

SteveO (21:56.544)
negativity.

Dimi (22:13.956)
spin on his head a little bit as we've got better senior engagement is to try and show off all the good examples. And actually our big immigration data platform team here have really kind of continued along this and they were just always because they started embedding the Finale practices, all the day to day work they were doing. Every so often they just publish an internal blog post. So in the last three months we have done XYZ and we've saved, I don't know, a million pounds here or a thousand pounds here, wherever it might be. It doesn't have to be a huge amount, it just shows their kind of thinking. But I was going to show in...

sure enough about it and they've continued that culture and suddenly you find actually that encourages more people rather than having that fear of I just don't want to be on this list. It's actually I want to be part of that group, I want to be part, I want to be, I want to join them and see what they're doing. I want to think, oh can I use that, use that example and it then creates that culture and a more collaborative culture as opposed to the I'm just going to stay away from this because I want to be in trouble.

Frank (23:07.092)
There is also the...

SteveO (23:07.342)
I mean, the size of the operation is massive, isn't it? I mean, like there's a lot of people using cloud in home office. So you can't have all those people collaborating all together. So I say finding something that gets people talking. So there could be great practices over here that could save over here a shed turn, but if they never speak, they're never gonna know, right?

Dimi (23:27.396)
Yeah, and this is, but, and then when you find people wanting to be proud of it, they will start doing things like blog posts. It's quite hard to get someone to write a blog post about how they stayed off the naughty list. Quite easy actually, when you say, can you just write up a few paragraphs on that work you did and how much money you saved? Because again, if you type back to the whole, how many taxpayers' worth of money you saved, you can go, people want to show that off and show that it's good value for money. Whether they're permanent or some of the suppliers we use as well, they all actually, good news story that they've done something great.

I'm also trying to do more public things like this, to get some of that news out in the public because I think it's great for the public to hear. We hear too many other negative news stories, but actually, yes, we didn't do the CUT journey perfectly, but we're now doing things to really save taxpayer money efficiently and we can then reuse that money for something else and provide better services to citizens and so on.

Frank (24:18.742)
hours. And one of my question is, do you have or how you select it, I guess, on purpose, but to be to not to be on that list, you need not to have a good reason to have stuff turned on at 2am. And are they were they teams that were on the list? But yeah, it's because we need that server up and running at 2am because that's when backup runs or stress test happen.

Dimi (24:44.068)
Yeah, actually, they're both examples, actually, both examples. So sometimes when we're performance testing, if we're doing another 24 -hour soak test, performance test on something, something that was on, and then people, because they were kind of scared of being out there, they were suddenly like, oh, am I allowed to do this? Do I need to ask permission to do this? And I like, oh, part of the point of cloud is you just want to empower the team to go and do it. So go and use it if you do it. Don't mind if you do it, as long as it's being used and not being wasted. That's the whole point. No, don't use it. And similarly, we later found out that some of the,

keeping things shut down in non -prod, if there are things that run overnight, then you might not catch them unless you do the right kind of coverage testing. And it was only when one of our kind of worldwide systems that gets used all around the world went live. And then later on that week, it was doing a backup and it got a speaking activity on the database at two in the morning. And then someone was like, why didn't we pick this up? And I was like, well, no, we've, well, cause it was shut down. And then suddenly it kind of suddenly,

pushed everyone in the opposite direction. Oh, was this really worth it? Was it worth the outage to have saved all this money by not being on overnight? And suddenly, everyone kind of panicking, going, should we just scrap all that and just spend more money? You just need to understand that, right? If you understand your system, then you should understand your testing scenarios and test through those scenarios as well. But yeah, I haven't really thought about it.

at the time and kind of really brought it to life. And backups are the typical one is usually at like two o 'clock in the morning or something like that. And you don't realize what impact it might be having to your system.

Frank (26:14.486)
Yes.

Frank (26:18.966)
Yes. You remind me, there was quite recently some sort of a quote of someone like a CFO received the bill. We're going to set up a full DRC system, blah, blah. And look, that costs most of an outage. Just know I prefer the outage.

SteveO (26:21.422)
It's an interesting one.

Dimi (26:33.444)
Okay.

SteveO (26:34.094)
Yeah, you can. I think I've said this story before. I used to work, a company I worked for acquired a DR company. And I kind of said to him, how often have you had to invoke the offer customers? Never. And it is an insurance policy, right? And it's, and Hey, when you need your insurance, you want it there, but it can be a very expensive thing to do. You need to really balance the, the pros and cons. Oh, did it.

Frank (26:58.934)
I had to do two DRs and I can tell you it was not fun at all.

SteveO (27:02.67)
It's going back. That's the problem, isn't it? That's often the biggest problem is going back. I think in the cloud, that's actually easier, but it used to be going data center to data center. It was going back to your primary. That was really challenging. Everything was set up to go do the way.

Frank (27:16.47)
Yeah, I think we had a news, it was last month for the first time that I think it was Aurora. You now can go back in an incremental way because in the past it was the same. The point is that most of the technologies that used to be used for DR, et cetera, is still used by AWS. The fact is they're using it correctly. They're using it nicely. They're doing the testing, et cetera. I'm pretty sure there were other companies that were doing it rightly that had a good DR system up and running and maybe probably still have.

But it is, it is a complicated thing. Always. Because changing regions, the thing is that it's very hard to have a full region going down and you have availability zones or all of a sudden, but when something for, when something is not working on AWS, you can, you hear it in the five next minutes on all the tweets of the world. And, and usually they might take days or hours at least.

to bring things back up and lots of services are down for how much DR there is, it's still hard to get a good DR. In the cloud, it's much simpler. Doesn't mean it's simple.

Dimi (28:29.476)
Yeah, I would tell you the complexity of it. It feels extra complex if you've been in the position of it's gone down, I don't know what we're doing. And it feels really, really complicated. In a previous life and a completely different job, we had two data centers that we built and we had never tested DR and everyone was always scared of it and thought it was so complicated because they'd never tested it. Actually, maybe they tested it once right when they first built it and had never tested it again. It was like four years I'd never tested it.

SteveO (28:54.86)
Yeah.

Dimi (28:55.178)
Too often we forget that in the cloud, if you're a good practice, then you're usually across multiple data centers anyway. So you're as good as you ever were before because you're on two, three plus data centers. I've never done much in the multi -region space and properly found it in the multi -region or if some people are looking at multi -cloud, that kind of thing. I know that is more complicated, but we've already got multi -data centers almost for free if you're using cloud services in the right way.

Frank (29:02.966)
Yes.

SteveO (29:20.078)
Yeah, yeah. Yeah. ASTs are spanning, you know, availability zones for you, right? I think what's really refreshing for me listening to this is I think quite often, I can't, there's a saying around about, I can't believe it is, that the private sector is always leading and the public sector follows. But, you know, I do a lot of work in both, but probably more in my background in private. But the things you guys are doing, I,

Frank (29:48.022)
Yeah, brilliant. Yes.

SteveO (29:48.786)
I'm streets ahead of a lot of things that are going on in quite forward thinking. I mean, I'm not going to say any names trying, but we've done some, right? Very like technologically savvy forward thinking organizations and nowhere near the level of you guys are thinking about and doing, and even the pragmatism around just within a relatively short time making commitment decisions and then kind of go, right. And that piece we will review then, and we'll have this iterative process.

Frank (29:58.742)
Yes.

Frank (30:02.238)
Yep.

SteveO (30:18.734)
It's, it's staggering how many people who start getting there. And obviously there's a few, you know, you speak to Natalie daily at HSBC and you know, she's a bit of a Phenops God, isn't she? So, you know, fair play to her. They're doing really advanced things, but actually.

The kind of the next tier down is a real mix of organizations. And I think you guys are, are in there in terms of things you're doing. You know, you're heavily automated. You're very much, um, using kind of data -driven decision -making. You're always looking at the technologies, not just the cost, you know, is this the right technology for the job and these kinds of things. And that's really refreshing to hear. And like you say, as a taxpayer.

Frank (31:02.902)
Yes.

SteveO (31:02.99)
That's the kind of thing you as a taxpayer who's in this space, it's really nice to hear because I think actually in certainly in the cloud space, money is being cared for. How many taxpayers do I think that's a real nice thing for people to hear is happening, you know, that's great.

Frank (31:08.886)
Yes.

Frank (31:18.902)
Well, it's, yeah, I found it also. You're sorry. Yeah, go. I was just saying this is a, it's really refreshing. It's really positive also. And it also goes into the, there is this still debate at the moment as we were discussing this. What is good use of cloud? How do you use the cloud? When is it worth using the cloud and all this kind of stuff? It's very refreshing to hear you saying, look, we did it that way. We had a surprise.

Dimi (31:19.626)
Yeah, yeah, and I think we're... No, go on, Frank. Go on.

Frank (31:48.534)
We act on it. We were allowed to act on it because it was doable. We've used the people were able to act and reduce their costs and organize and take ownership of things and that it all worked in the version spoke. And I'm sure there were some teething problem in between and some other fights, but because it never, nothing is as wonderful, but it is really, in this case, it is a really good positive thing from the cloud perspective of saying yes.

SteveO (32:06.19)
Yeah.

Dimi (32:07.274)
Mm -hmm.

Frank (32:17.654)
It, you might have surprises, but you can solve them. You can act on them and it's going to be impactful. Whatever you're doing, the small step you do will have an impact. It's going to be visible and it can be done. I love this.

Dimi (32:36.042)
Yeah, and I think the next, I think my next big step is kind of really working where we re -architect things into potentially more service, managed services we have in some space. And I have to give credit where it's true. So when we first started the financial, I was working quite closely with a colleague, a contractor, Andy Esley, who's now moved to another consultancy, a cloud scaler. But we came up with this idea at the time around kind of an efficiency chart, either like your home buyers report where you see how efficient your house is, or you know, when you buy electronic good and it's like,

SteveO (32:36.558)
Great.

Dimi (33:03.914)
somewhere that you're a GDE, other stuff. And we kind of had this vision, I'm not sure it'll ever be, again, maybe it's a business case, but it'll be worthwhile building it. That if you have an application, you know that, it's an F rating because it's not doing my customization. Under its current architecture, if you get the scaling right and you're right sides there and reserve it, maybe you can get move all the way up to a C or a B or something like that. And that kind of shows you what's the upper limit for the current application architecture, what it'd be. And then,

Frank (33:04.182)
Yeah.

Dimi (33:32.938)
And then it kind of helped, it would be easier way to visualize and help drive the conversation. Well, why don't we go up to B? Or if you're really close to B, people are going, well, why can't you be in A? Because actually that would mean re -architecting the whole application something differently, which would be a much bigger business case. At least you know, and then you can actually try and measure it against whether that would be worthwhile or not. And I think we're having more conversations now at that top end, where we've got things more efficient and actually how can we do, should we be exploring more serverless and the various pros and cons that comes.

SteveO (33:44.782)
At least you know.

SteveO (34:02.19)
So nationwide in the US, so not nationwide that we automatically think of in the UK, they gave a fantastic talk at, at PhenopsX. And it's where Joe Daly used to work, he's a community head. And it's two brilliant ladies who were up on the screen. And I actually, in my blog article post, PhenopsX, I named them and I forget, sorry, ladies, because you were also very kind to me in reviewing it and letting me publish it, but I've forgotten your names off the top of my head, but they were brilliant. It's one of the best talks at PhenopsX.

And they had a traffic light system. The coin system was cost optimization index score. And it basically looked at the kind of the ratio or the division of cost of application over value of low hanging fruit optimizations. And it would give you a score and that score would be, you know, red, amber, green.

And that was kind of the first stage of what you're talking about. Very much a case of where do we need to focus on? Where do we know that actually straight away there are easy optimizations. Well, you're going into, and that's a great place to start. I think in a lot of your world, you're already kind of passed out, but it's still a good, easy way of kind of doing the, the, uh, the scoring system. Like I say, you're going on to the next piece and it is a bit of work where you actually kind of then thinking about architecture. You're thinking about where spot can come in and these kinds of pieces, but at some point.

people are going to have to get to that level of cloud inventory or kind of cloud workload rather than actually the infrastructure that is around what actually is this thing delivering and the things that comprise this delivery, this output, you know, how much further that can be driven. That's the only way, because like you said, in year one, there's loads of things you can do. You know, buy a load of RIs, go do your right sizing. You can save 30%. In year two, people are still looking for you to save 30 % because you did it in year one, but you're lucky to save five.

Frank (35:46.454)
Yeah.

Frank (35:53.366)
Yep. Yep.

SteveO (35:54.958)
Right. Because you're having to find much smaller screws, which are far more fiddly and keep doing that. The only way you can keep making benefit in progress. And Natalie Davis spoke about this a little bit. If you stop and carry on doing what you're doing and just stay there, suddenly you're not running anymore in the fin -off space. And you've got to be looking at the next thing because you're constantly being asked for more. And it's a real, real balancing act. I think you guys are coming onto that and your system while it would be a bunch of work.

It's probably how you stay ahead of the game.

Dimi (36:26.762)
Yeah, and I think to put your point about trying to save money all the time, actually what I've seen over my last few years, different roles is that what we end up doing is we never end up actually very often dropping our bill down from what our bill is. We have a flat lining or just increasing by maybe a few digits per year because it's not like even though we've made this massive savings, we say, oh yeah, we saved 30 % here and 80 % of that, but our usage is growing all the time. We've got new services, we've got new systems rolling out.

things we didn't think of before. This is why I actually compare it. We sit down with our finance team sometimes every year, every three years and things like that. And I thought, of course we can go down. It's like, well, we've made much more efficient use, but we've also rolled out at least 10 new applications. And this one went from taking a million applications a year to 10 million applications a year. So there's so much more going on now. We've had to scale all the components up. There's quite a challenging conversation there to try and articulate that overall spend to non -technical stakeholders. I think we've really got there with a...

SteveO (37:07.906)
Yeah.

Dimi (37:24.328)
We can educate techies now, rolling out cloud for the developers or rolling out infrastructure. We've got a good grip on how that needs to be done. We just need to keep turning the handle on that. But we're now signed to learning. How do we talk to our financial colleagues, our commercial colleagues? How do we talk to senior management, particularly senior managers who have been there for a long time or our non -technical senior management all the way up the chain as well, to try and articulate somebody's concerns in a way they're willing to understand?

What people who are at the top of the shop, they're not going to care about really small details about small instances. They care about how much are we spending on cloud and how do we minimize that as much as possible.

Frank (38:01.878)
Look, I think you're, yeah, but that's the point is you gave a brilliant unit before, which was how much money, how much taxpayers did we save? And at the same time, you can use it back to say, and how many taxpayers did we serve? And so you start saying we serve 10 million people all of a sudden, we serve 10 million people using 1000 taxpayer money, whatever the number is, I'm just throwing numbers there.

SteveO (38:02.03)
Yeah, it's got to go back to the units, right? And metrics.

SteveO (38:09.454)
tax payers.

SteveO (38:16.782)
Yeah, with this service and the cost of a taxpayer.

SteveO (38:23.95)
for the same amount of money.

SteveO (38:28.844)
Oh yeah. Yeah.

Frank (38:31.19)
But yeah, I think...

Dimi (38:31.562)
Yeah, as we move away from our programs at work and sort of product centric model, we're trying to push all our product managers to have a clear value statements against it and measured against costs. And some of us, the cloud costs, some of us, all the other associated costs they might have for their service and associated with it. It's usually that kind of initial curve, right? Where things you need to spend some money to start developing the service. And then if you roll it out in the typical kind of government structure or private beta, public beta and so on, then...

The cost might seem quite high per user, but as you really get through production, you can continue to use those metrics to say, this has served X number of millions of people.

SteveO (39:10.158)
Fantastic. I'm just looking at the time and you've been very kind to give us some of your Friday afternoon. And I think it would be really great to maybe get you on again in a year and hear how you've attacked this challenge because the challenges you're having, people either having now or going to be having in the next few years, right? And it's really great hands -on experience. So I just want to say a big thank you for me. I've learned a lot and also very reassured.

Frank (39:21.684)
Yep.

SteveO (39:39.746)
about my tax money. And just any final thoughts or statements you maybe want to share?

Dimi (39:40.65)
Good.

Dimi (39:48.074)
I think just to reiterate what I said earlier about knowledge, knowledge is king. So keep messaging the message out. I remember before I had to get approval to do this, but I come to our comms team, there's always been concern about talking public about what government's doing. And then when I explained to them what FinOps was about making more cost efficient use of cloud, they were like, yeah, talk about it. It is great, isn't it? This is a great use of it. So knowledge is key across, and it's not just the Decom community, it's about everybody out there about what FinOps is.

how to do it.

SteveO (40:19.598)
Thank you for sharing your knowledge. That's been fantastic. And listeners, thank you for listening. That's a bye from me.

Frank (40:20.982)
Thank you, yes.

Frank (40:26.39)
Bye from Mizu.

Dimi (40:28.17)
It's happening.

Frank Contrepois

Host

Stephen Old (SteveO)

Host