The Ruby AI Podcast

The Ruby AI Podcast explores the intersection of Ruby programming and artificial intelligence, featuring expert discussions, innovative projects, and practical insights. Join us as we interview industry leaders and developers to uncover how Ruby is shaping the future of AI.

All Episodes

The Ruby AI Podcast

Evaluating LLMs with Leva

August 26, 2025 • Valentino Stoll, Joe Leo • Season 1 • Episode 5

In this episode of the Ruby AI Podcast, host Valentino Stoll talks with special guest Kieran, a prominent figure in the Ruby AI space. Kieran recently gave a talk at the San Francisco Ruby Meetup about his new gem, Leva, which focuses on LLM evaluations in Ruby. Kieran discusses his background, his passion for AI and Ruby, as well as his journey in building AI products, including his tool Cora, which helps manage email inboxes by categorizing and summarizing emails using AI. Together, Valentino and Kieran explore the process, challenges, and best practices of creating AI-driven gems and tools in Ruby, the importance of evaluations, and the fun and creative aspects of integrating AI into Ruby on Rails projects.

Mentioned in the show:

Kieran Klaassen – Ruby developer, creator of Cora and Leva.
Leva gem – Kieran's LLM evaluation framework for Rails.
Jumpstart Pro – “is the best Ruby on Rails SaaS template out there”.
Stepper / Stepper Motor (workflow engine) – a “journey” with steps for background jobs.
Jaccard Index – A metric for set similarity (|A∩B|/|A∪B|).
LangSmith – a platform for building production-grade LLM applications.
Morph LLM – The Fastest Way to Apply AI Edits (4500+ tokens/sec).
Friday AI Agent – An AI-powered coding agent that handles PRs from start to finish.
DSPy.rb – Framework for building AI agents and optimizing prompts.

Highlights:

00:00 Introduction and Guest Welcome

00:53 Kieran's Background and AI Journey

01:20 Building AI Tools and the Leva Gem

03:47 Challenges and Best Practices in AI Development

07:16 Evaluations and Real-World Applications

07:36 Community Recognition and Adoption

12:37 Prompt Engineering and Model Testing

22:06 Leveraging AI for Workflow Optimization

28:35 Visualizing Workflows and Tools

31:44 Exploring Hybrid Orchestration Layers

33:15 Debating Deterministic Workflows vs. Agent Flows

34:28 The Fun of Experimenting with AI and Ruby

34:55 Building Gems and Learning Through Creation

40:03 The Value of Rails in AI Development

46:28 Evaluating AI Outputs and Metrics

50:40 Annotation and Continuous Improvement

53:50 Future of AI and Rails Integration

54:54 Closing Thoughts and Recommendations

0:00

Hey everybody, welcome to another episode of the Ruby AI podcast. I'm your host today, Valentino Stoll, and we have a very special guest today, Kieran. And, you know, we had you on, Kieran, to talk about that, but you're also very prominent in the Ruby AI space as well. I was not aware of this until I saw this talk. You are outspoken about a number of AI development best practices, even. I've even taken on your Git workflow, Worktree workflow, and it works really well. But yeah, why don't you introduce yourself? Tell everybody where you started with this and what you're up to. Thank you for the invite. Yeah, very glad to be on there. I'm very passionate about Rails. I'm very passionate about AI. And I've been building AI products using Rails for the last four years now. And yeah, I've seen the ugly and the hard parts and would love to share a little bit about that, what I've learned over the years. And Leva is one of those tools that came out of it. I'm building Quora Computer, which is a tool to hand off your inbox, your Gmail inbox, and Quora will manage the emails that are not super important to respond immediately, brief them twice a day and summarize them. And everything that is important, which is mostly 20, 10% of your emails that you receive will stay in your inbox and you will feel great because you look in your inbox and it's emptier. So I'm building this tool and obviously you need to run evals to see whatever you do is working. And in this case, the first thing was categorization and you start to look around and yeah, there is nothing for reals. Like you just do a gem search and it's like, okay, there has to be something. Andrew Cain probably has to have made something or someone has had to made something and it didn't really exist. And I did Python in the past as well. There are great tools for Python. Like I used them and I was like, well, it's not too hard so why don't i just build this and many of the tools i've built for myself are just like needs like i have a few gems that i just wanted and no one else made them and most of the time it's finding like a needs to solve for, and I just do it. And the beauty with AI is also like, it enables me to do this actually in a week instead of like a month. Because before I've done gems and they're like really small, but it takes a lot of time. And if you have an idea and a problem, now you can either build it in your own app, or you say, well, maybe I spend one day extra, make the nicer abstractions and just create a gem. So yeah, that's kind of how it came to be. And I would love to talk about any obstacles or things I solve building with AI because it's really fun. I love seeing this. Honestly, I've built almost something exactly the same thing as this internally before. It's great to see this out in the public and honestly, really well designed. So I'm curious what your opinion is of the gem creation process with a lot of this AI stuff. Because personally, I've had a good experience. I'm curious where you find it works really well and how it's freeing up time for you like in that way it's a good question because you can build internal things and normally like this is how i used to do it before ai i always did it internally and maybe i had a lib file or some abstraction already internally in the app and then oh i will move this to a gem later and then you never move it to a gem later because it works there's no reason to move it and it's like there's just a little bit of friction to do it because suddenly yeah like you work with two say two places and you need to test and like in a test app and a dummy app you you have all of that. So what I've learned is like, if you decide to make a gem, just make a gem. Because if you start out with already a dummy app and everything, it's actually easier to go than like abstract something out and having to do all of this thing. Because when you start out with an idea, you have energy energy you're passionate and motivated to do it and after that like why are you motivated to create a gem like for me creating the gem is really like i want to reuse this into multiple projects i do multiple other side experiments because I just want to see how to build and what works and having a gem is nice because I can just add it and it works so yeah like how it worked the Leva gem particularly is heavily inspired by other libraries so first of all I dream of what is my ideal readme. So I start with readmes. I always write my readme first and think like, how would it look like? And then I just take a walk and think about, did I miss anything? What is the API? Did I miss use cases? And I refine this readme until I'm like okay this is complete enough and it's about finding abstractions and finding an API that is like I like and this is just taste I build it for myself so like whatever I like a certain way which for me needs to be modular but it needs to bring actual solutions and I wanted to use it with whatever I want. I don't want lock-in on a library or lock-in on like open router or I just want to have more of an adapter like where I can just run whatever I want to run. So it's like having this freedom, but it should solve real problems for me, which is I want a UI because I'm visual. I want to have a UI on production that can work with encrypted data, similar to like I was inspired by the workbench from Anthropic there, where it's like you have the prompt and the variables and you can kind of debug. And in the end end all these things come together and i just sketch out things write a readme and just start building and the process of building back in the day was sonnet 3.5 and composer in cursor which was amazing if you look at it now you're like it's so old-fashioned but back in the day was like oh it can create multiple files wow all right so another huge leap in advance exactly but in the end evaluations are not very complicated the complicated part is like how do you design your models so that you have the data and store the results? Like, it's not very complicated. You just run something. Is it the same? Yes or no. And that's basically it. So that's where it went. And yeah, I just shared it on my GitHub. I never thought anyone really would look at it. But like at some point I got a text from chris oliver he said hey you're in you're in the talk and some kind of talk he was presenting in the talk in in brazil i think and irina was talking about my jam you're like oh she's talking about your jam i had no idea so she kind of figured she found it somewhere and she was talking about ai and ruby and how we should promote like a community work and yeah so that's when people started to like it and use it and it went from there so like the goal is really just to build something really useful for myself but i'm very happy if people yeah that's how it should start. And that's really the philosophy. So, and then I'm very happy if other people can use it as well. Yeah, I've been using it. We have this little app that I like to build with all of the things that we invite guests on with that just manages the podcast. And so I've been using it just like as a way to like analyze the prompts and agents that we have for doing that intake and yeah it works really well evaluations that i've done you know it's langsmith that's like the staple everybody starts with it's connected everybody yeah it's python it has an api you could submit your requests there yeah and also with privacy and encrypted information like it's hard to do that. But yeah, like absolutely. It's inspired by that. Yeah, totally. And so looking at this, it's very similar in nature, right? Like where you just have your experiments and then you can run your individual evaluations against different experimental data sets. And, you know, it has a very similar feel. That's the problem with a lot of Rails apps. Like you mentioned, the, you know, privacy and security, where you have all of your data in the Rails app, and how do you get it out and make use of it, right? To make sure that you're getting your snapshots and making sure you're consistently executing things, right? With all of this disparate systems. All of the changing results that you get with llms right so i'm curious like i imagine that you've used langsmith before or some similar evaluation tool how has your personal like transition been since keeping it in your stack tech stack now is it a lot better are you finding things that you know you're missing out on that you now have extra work because you have to make it in leva right so langsmith i love it's great but for me like i work with encrypted data in a database and like i just cannot use it i'm not allowed to use it because of privacy policies and security audits like it's just not possible to do so is it better yeah because i can use it instead of not use it and is it like the most amazing vc funded evaluation framework that's super slick no that's not what it is. But think of it like Blazor from Andrew Cain, for example. Is it the best, most amazing business intelligence? No, but it just solves the problem. And it does it in a way that's not super complicated. It's pretty simple, so it's easy to learn learn and you can always move to another service or get another pipeline set up to do evaluation and like i'm sure you can do that but the whole point is like i was the only developer for seven months and i just needed something so it's very practical and that's where it came from. Like, does it solve my problem? Yes. And the fun part is that it lives really close to the production data. So you can do really cool things with it. You can just go into your app, see something, and you're like, oh, that's wrong. And then I have a button that says debug and leave. And then it just preloads the data into a data set, loads the correct prompts and the runners and everything. And I can just start jamming on the prompt. And I think reducing friction is very important because if I had to first figure out the whole context, which is very hard because it all lives on production. And then like copy paste variables from here to there and go into another tool if it was like that i wouldn't have done it so anything to reduce friction i would say is very nice so that's the experimentation side in addition to the evaluation side but yeah i i've many other ideas i want to also add to leva that i think are very useful that I'm experimenting with now. It's hard to show this because there is, like you mentioned, a UI to this, which is kind of critical to the whole process is being able to visualize these, like what is happening, right? Before we dive into that, though, I loved your talk in San Francisco, where you basically made an evaluation to observe like business metrics on like what was it AWS cost management or something like that and that to me is like maybe something that isn't talked about enough is what to evaluate from LLMs in general like where do you start measuring things what do you measure what are your like critical components when you're like building something with a new prompt or a new agent or whatever it is that you're like wrapping? Yeah, that's a good question. So you can do this with any framework, but how I evaluate is first of all, you just create a prompt and you just jam on it and you see it doesn't work. Doesn't not work? Like, let's take an example of categorization. This is a simple example, but also what I do with emails. Is it important enough to stay in your inbox or should we brief it? Like, it's a very important decision. And you can start with a prompt. Hey, if it's urgent, then put it in my inbox and otherwise just breathe it. Like that is the prompt. But how do you know that works? Well, you go insert an example email and click run. And then it says yes or no. And you're like, oh, it works. But then you start collecting real world emails. This is local. But if you deploy, suddenly you have emails that say promotion urgent. You have two days left. And suddenly the promotion is also urgent. And you're like, dang, okay, this prompt does not work. And the whole point is then adding that email to a data set and say, oh yeah, but if it looks like a promotion that says urgent, maybe then it's a promotion. So that's refining the prompt and then you run it and then suddenly your two emails say, yep, it did it. It's really important to see your prompt work. And how I think about it is like for real developers, test-driven development or like writing tests for your codes just ensures your code works. And you need to do the same thing for prompts. Actually, you need to do it even more because prompts are not deterministic. So they can do whatever they want. And it's more important to add tests to them. And evils are just tests for prompts. So if there is logic embedded in the prompt, which most of the time there is some logic, sometimes it's summarization only, but lots of the time there is some logic or decision-making. And you want to see if it works. And it's very important to do that at a larger scale where you have more examples of actual real data. And you could even do that on like development where you already ask an LLM, hey, create synthetic data for like all these things. What could go wrong? And really push for that. And you can already start with a little bit of a better prompt before you even go live. So that's kind of the framework and that's one example and another one that's very important is you have a prompt that works that is on production but google launches a new model or anthropic launches the new model and the new model is supposed to be way better and how do you know? How do you know if it's better? And if you have evals in place, you can just run it. You just say, okay, run the new model. How much better is it? 8%, 2%, 6%? And that's very useful. And the flip side is someone launched a way cheaper model, but it's also worse, maybe. How do you know? Well, same, you just run the model. And my example was, so we use Haiku. I think Haiku 3.5 for categorization for a long time, but then Flash came out, Flash 2.0, and I just wanted to see, because Flash was like 10 times cheaper. Way cheaper, like 10 times cheaper, way cheaper, like 10 times cheaper is a good amount of cheaper for millions of emails. So I just ran the test, the evaluation for my categorization, like how much worse is Gemini? And Gemini was 6% worse. I think it was, and I was like 6% worse for 10 times cheaper is great. That was like 150K per year saved by just making that switch. But it's very scary to make that decision and just YOLO it. You should not just switch the model and hope for the best. You need to know what you're doing. So it will allow you to move faster with more confidence, similar to tests. If you have a passing CI, CD pipeline, you can ship faster. So it's a very similar story for LLAMs. It's interesting you tested model changes because in my mind, I feel like prompts are almost coupled to the models that they use in a lot of ways. Like the models are trained for specific formatting and some handle some domains better than others. So I'm curious what your experimentation process is like for those aspects. Do you also couple those model changes with some like prompt tuning, or do you like to keep those separate? What is your workflow and reasoning maybe for like deciding those aspects of it so first of all i want to try just same prompt same everything don't change anything like a different model how does it work differently because you learn something from it. It might be even that more expensive model is worse than a cheaper model. Like, for example, if you use a cheaper model, normally you need to have more direct prompting and repeating and silly things. If you don't do this, you go to jail kind of hoops. But also the newer models are actually converging a lot and it's less and less and less. So models are actually coming pretty close. They all do XML well. They all do JSON well. They all follow directions pretty well. And with Flash 2.5, for example, the difference between Flash 2.5 and Sonnet 4 in certain cases is not that big. Yeah, there are very different models, but if you just do categorization and you have like a chain of thought and prompted, they're not that different. They're not like the price difference. Like I say, like Flash 2.5 was ten times cheaper, but the other is not 10 times better. So I always start with that, but I do improve prompts as well. And actually, I used to write prompts myself. I don't write prompts anymore. What I do is I have Opus rewrite my prompts. And how I do this is I use Leva as well. So I just say to an agent in Cloud Code like hey this is the prompt this is the input this is the data set I want to run can you run it and look at all the failures and see why they fail analyze the chain of thought so I always include the chain of thought because then it's easier to debug and can you rewrite the main prompt yeah until it runs always correctly. Or run it 10 times and make sure that all the 10 times I run this, it's actually giving the same output. Because sometimes you run it once and it's green and it passes, but the next time it's not. So like, it's very important to run it multiple times. So I don't write it as much anymore. Like I think AI is better at writing prompts now than humans. And using the Anthropic Console prompt improver, that's always how I start. So I just start jamming what the prompt should do, then put it in there. That's the baseline. And adding real world data and see if it actually works or not. There is some sale craft to it where I do add things, but it's less and less. I do it less and less. Yeah. In the early days, I definitely started out asking, asking it to redo my prompts and, you know, with having mild success, but it definitely has gotten better now i agree with you have you tried any of the tools like some of these d spy frameworks to do the prompt optimization based on your runs or anything like that a little bit yeah a little bit i just want to keep things simple but yeah i'm thinking about including a similar thing in Leva as well, because there is data and it makes sense to improve prompts automatically and track it. So it is a very natural place to put it in. Yes. You mentioned using cloud code. I'm always tempted to just hook up the web appCode and see like what kind of cost that generates. Yes. It's very tempting. I know it's maybe not the most secure thing to do, but you do some like kind of wild stuff with PlotCode. I'm curious, what have you found like you mentioned like repeating, you know, runs and stuff like that. What do you find the most optimal use cases specifically for Ruby and Rails development while you're going and churning? Where do you see it giving you the most benefit? It can do so many things. It's very powerful. It can do many things, yet it's also a tool that you need to learn. And really where I see the most power is it's on the intersection of writing prompts, testing prompts, and writing the code, and getting data from real runs runs and iterating on that. Like it's that kind of work all together is insane. Obviously, it's great for research and creating issues in GitHub and then implementing those and adding tests and reviewing pull requests. like it does all of those things as well it's basically my intern and the intern is pretty legit but really the powerful part like if you're building like i don't know if everyone is building with ai or for ai or both like i think it's really special when i call this kind of compound engineering or like compounding engineering where you have a problem and you don't solve the problem, but you solve the meta layer, the meta have a bug and I need to debug it. Either you go debug and do the bug or you write a prompt that then can run and debug. And then the next time you need to debug something, you just do the prompt. So investing in like the meta layers and the compounding things, I think that's where it's very special and people should try to do that more. You should not use cloud codes to debug. You should figure out how it can debug the next time for you. That is where it's really special, where you can really build, yeah, like compounding effect. And it's a similar thing with using an ai to write a prompt for ai like it's so cool and having itself improve and just sit back sip coffee and it just says oh i'm running this now and like four out of ten times it failed let me look at the chain of thought and then having four sub agents look at every single run and then synthesize it at the chain of thought. And then having four sub-agents look at every single run and then synthesize it at the end. And I say, oh, it got tripped here. So maybe we should not mention CEO or blah, blah, blah there because it removes it and then run it again. Let's see. And just going like that, I think that's so magical and it's so powerful. And I wish everyone in the world tries that out because if you can get there that's a true hook or aha moment for like this is amazing yeah you're so right i liken it a lot to learning how to use the command line in general yeah there are just like so many commands and so many commands you can make and aliasing and all this stuff that's just available that you just don't know and it's like it takes some getting used to like how the piping works and similar to loms like all these little tricks of the meta aspects of like the lm can do all these things the command line can do too you like compound it yeah it can get overwhelming also because the workflows can change. You'll find another workflow that works better for you. And like, I'm loving how it's like truly getting creative where you're like, you can like make the workflow just work for how you work. Yeah, absolutely. Yeah. It's so flexible. And like, you don't have to do it how anybody else does, right? It's also very overwhelming and very scary to a lot of people because people do like their habits and they are comfortable a certain way. And you do need to have a certain mindset to say, actually, I am not using an IDE anymore. I'm going to use the terminal. Like some people use the terminal their whole lives. But like for me, I'm a visual person. So just giving up a visual thing and going into the CLI in a terminal, like, yes, I know the terminal, but it's just where my stuff runs or I don't have Tmux and Vim and whatever stuff. I don't do that, but I am more comfortable now with Cloud because yeah, you need to experiment. And if i am more comfortable now with claude because yeah you need to experiment and if you get more comfortable using it and seeing what it does and how it works you can truly utilize all of these things that are in your terminal already that i have no idea about but yeah it's really cool because it can learn all these things and it can use all those things and it's very powerful because yeah what you say yes you could if you have an app that's not very security prone you could just say oh yeah you can ssh into this machine and debug the production data like why not because that's what i would do either way then it It's funny, all this hype of Docker almost seems like Docker was made specifically for like gearing up for this moment. We can just like, you know, I'll spin up a container and just run this thing like in isolation, right? And if it destroys itself, who cares? Yeah, you can do whatever you want in this environment. Just do something. Yeah, absolutely. Yeah, you can do many things. It's very overwhelming, but also very, very fun. So what I encourage is just to experiment and see what works and also follow people that do cool experiments. There are so many people on X that just do amazing things. And I'm like, oh, or read some documentation on Anthropix website. They have very good documentation about how to use cloud code as well. I want to dig into the fun aspect because 100% agree with you there. It just unlocks so much when you get to that creative mode. But before we do, digging back into the UI aspect of things, because I'm a visual person as well. What are you using to visualize maybe your workflows or the evaluations you already have like this great UI for? What aspects of the runs or things like that and your workflow, are you using tools to help you visualize those aspects of work getting done? Are you just using GitHub? What are you finding lately? I use GitHub for just the builds. of like work getting done? Are you just using GitHub? What are you finding lately? I use GitHub for just the builds. I think for workflows. So I use this gem called Stepper Motor by Ulic, which was inspired by Heia from Honeybadger, but just made it more multipurpose, which is you create a journey and there are steps in a journey. It's basically a workflow. And it's really nice. So I use that for my more complex LM workflows because he's also working on a UI, I heard through the grapevines. So then I can just look at a UI where I can see, hey, my AI workflow is in step there. This is the average for these steps it filled here. And I think I've looked at many, many background process things, but I, I like a stepper motor. So I can recommend looking into that. He's actively working on that. I think it's a missing piece. If you are doing some more complex LLM workflows, you just need some workflow management. You cannot just call a job that calls another job that calls another job. It will be a mess. So yeah, and you have some visibility also, like a mission control for your workflows, basically. So that's what I use for workflows now. I just started using it. I did job calling a job calling a job before, and it's terrible. Yeah, I agree with you there. It does make me wonder, like, I'd love to hear your thoughts on like the process of like single agent, you know, spawning other agents. Or just like a single threaded agent that does work in steps kind of thing. Are you leaning more toward the latter in that way? Like breaking down stepwise things, even if they happen in parallel, but like having a main process that manages that? It depends for what use case. I think you need both. Very practical for me, workflows are cheaper. I run a business. I don't want to pay hundreds of thousands of dollars per year in LM costs. Today, models are still too expensive to just have an agent run and do whatever you want. Like it is too expensive, even though it is cool. So for Quora, I use both. I think for engineering, clearly agentic. Like I don't think workflow is anything you actually need to do for development because you want flexibility, you want the creativity. I use agentic workflows more for like memory extraction or analysis of what kind of personality because you need more creativity. You need something that's less generalized and more creative and adapting to what it needs at the moment. But if an email comes in and needs a category, it just needs a category. Like that is a very fine workflow. And also, I don't categorize everything. Like I get a tag from Google that says this is a promotion. Well, I'm happily using that tag and the LLM spent and models that Google uses for the promotion to also take that into account. So yeah, there's that. And there's obviously a hybrid of the two where you have like an orchestration layer where there's an agent that can run subagents and the subagents could actually be just workflows because you already know what the steps are going to be and what you want out so yeah you can have both i think you should familiarize yourself with all but people that say this is the way i don't think there is a way i think it just depends on what you're trying to solve and i know that's like the cloth code team and the devon team the cloth code team and the devon team had like at the same day set opposing things where workflows are the only way it works now and the other ones that agentic is the only way it works and in the end they were like oh, actually this is just more of a clickbaity thing and we all agree. And we are inspired by all and everything. So they all agree that there's no one way and there's very healthy stuff to be learned from either side. Yeah. I am also torn on this. It definitely works for some workflows and some not. Every time I, somebody asks me or I ask somebody else, it does always come back to the same old, it depends response. And I feel like this may be true of just how to work with AI in general. Yeah. I think one thing to add is I think workflows are more deterministic than agent-agentic flows because there are more checks in place which if the goal is you want it to be more deterministic then if you go for a workflow because of that reason i would say think again because agents plus evaluations might also be good maybe the quality with an agent and a good evaluation data set is actually higher than a workflow. But if you go for a workflow because you want to control the output, I think that's the wrong reason to go for a workflow. You should go for a workflow because it's cheaper. You should go for a workflow because it's better. It's not because it's more controlled. You should go for an agent because it's better. It's not because it's more controlled. Like you should go for an agent if that's better for you, but you need to have an evaluation framework in place because that is kind of what a workflow also tries to do in a way. It's like, just make sure that things stay in shape and be controlled, which is the same as an evaluation framework. So if you go for that because of that reason, then think again. I'm curious, the fun aspect here. I personally love experimenting with AI and Ruby in general because it's fun. And it's so expressive that it lets you do these wild things that I feel like is a little more challenging in other languages and maybe just not as fun because you can't read it as easily. Right. So I'm curious, like, where is your fun level? What are you working on that's fun or what have you seen that's fun? I think the most fun part is that I have an idea or like a thought about a gem that I would wish existed. And you can just decide, okay, today I'm going to have fun building a gem. And then at the end of the day, the gem is done and is usable. Like for me, just creating that is amazing. And for me, the fun part is actually going into the details and seeing how the architecture is like, I'm not vibe coding here i'm just lazy coding but i have a very clear vision on how i want to build gems and how it should be structured but i'm just relaxing looking at it do code a whole gem and i'm like oh we could add this layer here or we could all do all these things there or this abstraction or adapter pattern or like can you research like five ways to improve this and then i'm like oh can i learn something that's interesting like a service object way or adapter or like whatever patterns i'm a self-taught engineer i studied classical composition was a film composer before. That's my background. I've never learned any of these patterns or things. I've been an engineer for many, many, many years. So I have lots of experience working with frameworks, but I never really learned these things. So for me, it's fun creating something like that because I learn new things or how things work, or I use it to explore an already existing gem from someone i said oh i really like this gem what is the kind of the design they follow and i just check it out with gits and jump in with claw to say what kind of vibe has this gem or what kind of design language is this i really like this what is it called and then it's starting to explain these things to me so like for me it opens the doors like even to rails or like other libraries like to just dive in the source code more and it holding my hands because yeah i can do it and read it but like it's so much easier with an lm and then say okay i like all these things can you summarize everything? What the design pattern is? Can you like try to follow these patterns in my own gem and then see what it does? So for me, it's so amazing. Yeah. Yeah. How does this work is like one of my most common things just jumping at me like, this is like a 500 line class. Like I'm not going to read through that. Like I know all the things and yeah just okay well how does this work summarize the main points like how does this work with other classes right like that aspect yeah yeah and if you extract that out to higher level concepts like how does concurrency work and Ruby yeah huge value for sure yeah change it's what you think also fun yeah for me that's fun too like I love creating so for me anything that I can create is a lot of fun value for sure. Yeah. Change it to what you think. Also fun. Yeah. For me, that's fun too. Like I love creating. So for me, anything that I can create is a lot of fun. I think the fun part is mostly because it's not forced. Yes. I love building a company. That's also fun, but like I also have to do certain things and if there's a bug, I need to solve the bug. So it's also the not having the pressure. So I think if you want to try things, just do side projects, lots of side projects and just do stuff. Yeah. I mean, the value of all this generative stuff is you can keep generating. So, you know, you can start a side project and let it go and come back to it at any time or just restart it. Yeah, I find myself often just completely deleting whatever it created and starting over because it's cool. Yeah, 100% it just saves time, I think, going down the wrong path. Yeah, absolutely. Yeah, I've done that too. And those moments can be frustrating. Yeah, and I think in those moments also, lean into your strengths as like a software engineer and your intuition and you can make sure that doesn't happen by just doing the stuff you would do as a normal engineer for example do research have a erd ready or like some plan or like do some kind of research yourself or at least think through things and if you don't understand everything make sure you understand everything and don't just let it do things because that's what it does like it's really good at just doing things but make sure whatever the lm or cloth or whatever you use is producing aligns with your vision. That's why I like to start with the readme because a readme is a pretty clear vision, at least the DSL or the design philosophy is a vision. Yeah. Yeah, I love that readme aspect. Before AI, I would often like create a Ruby interface first before it existed, just as like a way to visualize yeah how the things will work together and then go back and like test drive those aspects of it. I love Gary Bernhardt's like the destroy all software I definitely drew a lot of inspiration there in that way. I'm curious like now, now that you've created this business, Ruby, imagining Rails 2, where do you find the value of working within Rails or Ruby that aligns well with building these AI systems? Where are the true, oh yeah, I wouldn't do this in another framework or language, maybe because it does this, right? Like it works well in this way. I love writing Ruby and using Rails. So I have another engineer working with me now, but I did it alone for months. So like I want to work in something that I like to work in. Like that's my personal take. But also it's very well aligned with the single founder paradigm like like is a single person framework and that is only enhanced by adding ai because that means as a single person you can just do way more work which is even more cool. So it just augments the values of Rails already in a philosophical sense. Like you can do more, even more using AI with this single person framework. So in that way, it aligns very well. But also if you write good code and good abstractions and you test things, like it's very good and the beauty is it's clear what the right way is because we have conventions already and you can just say to an lm just follow these conventions yes you can use standard rb and stuff like that as well but it's not wild it's not going wild and it kind of knows what the conventions are and if it doesn't you just say go to the rails documentation and research best practices on this thing and it just searches the web and like they're not like very opposing views on most things like it's pretty clear there are a few options for most things, and it works very well for that. I did start with Jumpstart Pro as a starter template, which I love. It's from Chris Oliver, and it just means it already has accounts built in and all of that stuff. And I don't need to deal with all of that. It saves so much time. It saves so much time. And starting actually with a starter project is the way to go. I think people say, actually, you don't need starter projects because AI can do it so quickly. I disagree because it will crash very quickly or I'm sure it can do it, but it's very easy to make a wrong turn somewhere. And if you have a starter project it's already grounded in good taste and decisions to a certain degree and it's yours to add your value or your business on top of it which i think is also very good yeah i'm curious to like contact seven as an example like all these mcp stuff they are just so useful and almost like ragging the ecosystem in a way on a granular level even of, you know, hey, check out these Rails guides for this aspect, like look up the documentation. To be honest, Rails is so well documented. I think historically there has been some blowback in the community on like, oh, we need better documentation and things like this, but like compared to a lot of other things. Yeah, I think like the starter, like how to learn Rails from the first time, I think that was always a little bit lacking, but that's good now as well. So yes, CodeDoc 7 is great for pulling in documentation and grounding. What I've used most is actually just saying to Cloud in my Cloud MD, use bundle open and just open the gem and see how it works. Because there's yards, like it works so great because, for example, StepperMotor has documentation, but also it is updated every day almost, or every week. So, why not just open the gem? So it's very, very good at just opening the source code and just understanding how it works. It's my favorite way to add context now. And it just does it automatically now if it needs to know how certain things work. I think I actually did see your ex-post about that and adjusted my own workflow. It does. It really does improve it. Yeah. Because you don't even need documentation at that point. It just knows how it works because it just. Self-documenting workflow. Yeah. It's basically self-documenting. And if you write good code and if you use good names and good structure, like it will only improve. Yeah. So if you use weird gems that do weird stuff probably that's not working but if you have a nicely structured and maybe yards in there as well have you started like documenting in the code like signals for LLMs to pick up on and read as an example the yard like notation can help help in that way. Have you found other cheats? Hey, look at this. So for service classes, for example, I just use plain old Ruby objects, but I always add class-level Yard documentation with examples. And that works great. I wouldn't have done that before, but now you just say whenever you're building it, you just say add yards at class level with examples and it's done. Like it's very easy. I know the examples are definitely key. Yeah, you need to have the examples because then you can like communicate business logic or like some contextual thing that's a little bit higher this is how you would use this class from different places and call it in different ways so i think that works very well and it's also really great for reviews so like if someone or an lm does that i can review that and basically you're reviewing the dsl and how it will be used, which is nice. Like that's nice to review and you can always go into detail, but it gives a nice overview of what it does. There's so much more I would want to dig in with you here, more specifically around evals too, which I love that the LevaGem kind of just has like this Ruby execute way of encapsulating what metric you want to create even really interesting and and awesome so i'm curious like for people starting out like they have these prompts and things that they're doing and they want to measure something about it where do you recommend them digging in and where should people be focusing their attention on you need to implement your own evaluator so you need to know a way how to say if something is good or not i'm not doing any of that i just provide you a framework and the hardest part is honestly is like finding a number or metrics that you want to optimize for and categorization is easy because it's either the right category or not. But if it's like writing a response to an email, like how do you measure if that's good or not, right? And how I do that is we track what people actually send versus the drafted version. And then we have a Jakar similarity score, which is like how many words match or like how similar is the message they actually sent for the message that we generated, that's a number, so it's a number from zero to one, which is great because a number from zero to one is very good for evaluations. So think of ways like, hey, how can I capture value or an output in a number of zero to one and one being the best zero the worst so that's really how you need to do it but you can do it other ways you can use an lm as a judge this is another way where you just have an lm with a prompt that says hey you're a judge to see if this is well written, blah, blah, blah, blah, blah. Okay, just make your own prompt. And this is what is written. And you have to rate this on a score from 1 to 10. And then you just normalize it to 0 to 1 and give me the reasons why or like have 10 things that you check for. And then that's an LM as a judge. And people do that. I never did that because I feel that's another layer of, it's another thing that you need to check again because does that prompt work? You don't know. So I would always steer towards like just hard coded more logic evaluation. Getting a discrete metric out of it is always better. It's always better. Yeah, but sometimes you cannot do that and you can use an LLM as a judge and you can set it up that it works pretty well, but it's just more work and more things can go wrong. Is there somewhere that you look for maybe inspiration on how to get these metrics? As an example, like summarization, like is there like a place that you look, oh, like is there an algorithm that already generates this kind of metric to see how well that it summarized the aspects of this lengthy document? Yeah. Where's your thought process lead? Yeah, I just jam with Claude or Chet GPT to see if there are any things out there and just do deep research on what is the science behind these things and Jakar score was one of those things that I've never heard of it but I was like well okay that sounds like what we need so let's do that so that's how I arrived there but at the same time there's also an art to it and and that's fine like it is fine to spot check things and vibe check things and just feel things too it's not all numbers like it is okay to read something and say ah this doesn't feel right and just go in and change little things and see if it feels better like it's also that it's both and i think that's important it's not only numbers because if the number says a certain thing, it doesn't mean it's good if your number is not good. So just realize it's just one piece of the puzzle, but it's a very important piece and you need all pieces. You need the art of understanding what's good and not as well. Yeah. It fits together. Do you find yourself like reviewing maybe the evaluations? Do you have like an annotation process? Is there like any aspect of like the review and that vibe spot checking where like you like feed it back into itself or where do you fall in that category? So Leva doesn't have annotation built in, but I have it in my own app. I just add records to certain data sets, which is kind of the annotation. So yes, I do annotate things and either it's automatic where there is a signal from a user that returns something to their inbox, for example. That's a signal that the categorization was not right so it will be added to a data set and then you can run an experiment on that new data set where maybe i say okay maybe i change something here will that increase everything here and then i add some emails that were right so there's a good mix of both does it still stay good there as well does it overall improve so there's that but there's also like i run it i see the one that failed and i see did it feel for like a very bad reason or like a reason that's like reasonable and if it's a very bad reason like like I go in and fix it. But if it's like, okay, for example, if it says it's categorized as it needs to stay in your inbox, but we don't know how to draft a response because we don't have context. It's not too bad because yeah, maybe I argue there is enough context, but it's not a bad thing. Like it's the wrong category. Like it's as wrong as something that was very important and got unsubscribed from but that's a very bad one so the spot checking isn't like yeah the number it's not the right category but like how bad is that miscategorization and if there is one i just jump in and levi you can experiment on that specific one and you go to the prompts that it used and you can iterate on the prompt and make some new versions so yeah that's how i do it do you find like that triaging aspect like can be solved by an llm eventually or not something that you would even foresee like worthwhile as it and i'm gonna yes i'm already like half doing this by an LLM so yeah absolutely it's just that I need to like Leva can make it a little bit easier where you can maybe have a button that says hey look at all the failures analyze what went wrong like those should already be in or run this in Gemini 2.5 but also run this in 4.1 or something like that. And explain to me what the differences are or like, what are the most different things? Like, I would love these things to be in Leva as well, because that's what you want to do. That's what you need. And then after that, like from those learningss can you actually rewrite the prompt obviously and then rerun it and there can be a loop there can be a gentic part to it i think that will be in leva very soon or sooner than later because i do that analyzing swaths of disparate information is like what is great. Exactly. Yeah. I'm always looking for ways that integrate that. It seems like somebody out there is going to make a Rails gem that just can watch the logs or active support and just like create analysis over, you know, yeah, exactly. Fix this or alerting just an observer for the entire Rails stack. It's coming. I have it. Oh, yeah. It's already there in ways. So I have like I use AppSignal and I just link to their API with a custom MCP in Cloud. For me, I already do that. And I can also live till the logs and analyze things in Cloud Code, which is also nice. So I personally already have that, but I'm still figuring things out. But yeah, I do that already. But there should be a tool that makes it this easy. Like AppSignal, for example. And see the UI, you know? Yeah. The visual. It's all about the visual. And Rails makes it easy to make those. So I'm looking forward to just seeing, honestly, what people are creating. Yeah, let us know. We want to use it. We want to pay. We do. We totally want to use it. It's not hard to create new things. Yeah. Please build. We've gone through a lot here, Kieran. And I appreciate you coming on and telling everybody about your insights, building the one-person business. I imagine you're maybe a little bigger than one person at this point. Two. But maybe not. Two people. Two, okay. Still impressive what Rails can enable people to do and AI too. It's really cool. We launched to general public last week and I built a whole company in less than a year, which is really, really cool. I had some help on the way for sure, but like I wrote 80% of the code. You know, I started using it myself just to see how it works. That's honestly works great. It does everything I wanted it to do. So if you can get it to work in the Hey app too, that would be awesome. But you know, I know like Gmail's probably your broader audience. For now, we'll add every other tool, but the MailChimp has IMAP support. So everything is built around the MailChimp. So, oh, that's nice. So it's coming soon. We have enough people with Gmail for now, but yeah, we will support other providers like Outlook and IMAP as well. Well, we're coming to a close here. Is there anything else you wanted to mention before we move in to kind of like our pick segment? No, thanks for doing an AI and Ruby podcast, because I feel we need more energy and more enthusiasm in the Ruby and AI space. I know there are a bunch of people that are really cool, but I think there can be always more. Yeah, 100%. I've been trying to get this started for over a year now. I met the right people. Yeah, I'm just a talking head. So I love meeting people that are working on stuff that I'm not even thinking about and just exposing it to the rest of the Ruby world because that's what, honestly honestly Ruby is great at. Yeah. I miss kind of like the Ruby mailing list aspects of spreading stuff. And it's just moving too fast kind of for the mail media. Well, at the end of the show, we like to just let you share anything that you want, something that you found useful or that you want others to know about. Anything at all. I'm going to share something that is not cloud code but it is an agent before cloud code came out with their unlimited plan i didn't use it as much because i was like i'm not going to spend five dollars on a git commit it's just too much because it was pretty expensive before. And I got connected with someone through every where I work as well. And they said, ah, try this out. This is a genetic builder and it's called Friday. And it's kind of fun because Friday is also similar to cloud code to see light tool, but it's very opinionated. It's like, not like cloud where it is like, you can do whatever you want. It's very opinionated. It's not like Cloud where it is like you can do whatever you want. It's very modular. They just say, no, it's not modular. It's very opinionated. This is how we do things. But I kind of like it and I love it. So I use it as well. And it's really good at just giving an issue and say, just implement this and go like YOLO mode. And I have moments that I use it for implementing things with Figma. They have like a Figma integration. They don't expose every MCP, but they said, no, the Figma one will add. And they make decisions. So if you want to try out a different kind of agent, like give us a week free trial as well. Like it's interesting to see how others think about agentic work and yeah there are many but i think this one is kind of cool because it's so opinionated and it might resonate with real developers where we are also very opinionated about how to do things so that's my random pick. It's also a very small team. There are like a few people working on it, which I think it's always cool. Small team doing things that compare with multi-billion dollar companies. So that's awesome. Yeah, I'm going to have to check that out. It looks fun. Codewithfriday.com is the URL. Awesome. Thanks for sharing that. It looks fun. I'm going for sharing that. It looks fun. I'm going to share this. It was shared in the Ruby AI Builders Discord. I don't know if it's an LLM on its own or not. Yeah, I think it is a model. It's called Morph LLM, but it basically is a way to analyze like diffs of files and apply them specifically for usage in LLMs. And it's like super fast. I think, yeah, 4,500 tokens per second. And so I would highly recommend checking it out. I've been using it in a couple of things just as part of my workflow, and it's been working great. So it's morphllm.com. Well, thanks for coming on, Kieran. It was great hearing all the lovely contributions you're making to the community with Leva and Quark's computer. Looking forward to seeing all the great stuff that comes out of that. Thank you so much. Yeah, happy to share, inspire, and everyone go build stuff and try things out. Have fun!

People on this episode

The Ruby AI Podcast