I just finished reading Emily M. Bender's & Alex Hanna's book The AI Con: How to Fight Big Tech's Hype and Create the Future We Want, and I come away with mixed feelings. They're mixed because they provide a comprehensive takedown of technology that I use (and that I'm asked to use) on a daily basis, but I'm not sure how much of it I buy at the moment.
I’ll be honest - when I first heard about GNU Guix, it sounded intimidating. Another package manager? With a functional programming twist? And what’s all this about “time machines” and “channels”?
But as it turns out, Guix solves a problem I’ve been wrestling with for years: how do you ensure that your development environment works exactly the same way across different machines and points in time? You know the drill - you set up a project on your laptop, it works perfectly, then six months later you try to run it on your server and nothing works because package versions have changed, dependencies have shifted, and you’re stuck playing detective to figure out what broke.
What if LLMs Learn Relations Like Humans Do? I've been thinking about how behavioral psychology might explain AI capabilities. Here's my working hypothesis:
I think emergence in LLMs comes from relational diversity. In this case, relations are the concept of verbally connecting things (stimuli, events, concepts, etc) in some way. This typically takes the form of comparison or hierarchy or many others.
Effectively, we can think of relations as kind of a graph, relating concepts to one another.
This document lives in several places for accessibility,
GitHub Google Docs (for comments) My blog Introduction The rapid integration of advanced AI capabilities into everyday applications has brought significant improvements in efficiency and user experience. However, it has also introduced new security challenges that demand our attention. In this study, we examine the potential vulnerabilities in AI systems that combine language models with external tools, focusing on Retrieval-Augmented Generation (RAG) in customer support scenarios.
For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.
At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript).
Data Scientists often need to sharpen their tools. If you use Python for analyzing data or running predictive models, here’s a tool to help you avoid those dreaded out-of-memory issues that tend to come up with large datasets.
Enter memory_profiler for Python This memory profile was designed to assess the memory usage of Python programs. It’s cross platform and should work on any modern Python version (2.7 and up).
To use it, you’ll need to install it (using pip is the preferred way).
As a data scientist, it really helps to have a powerful computer nearby when you need it. Even with an i7 laptop with 16GB of RAM in it, you’ll sometimes find yourself needing more power. Whether your task is compute or memory constrained, though, you’ll find yourself looking to the cloud for more resources. Today I’ll outline how to be more effective when you have to compute remotely.
I like to refer folks to this great article on setting up SSH configs.
Today we’re going to talk about what a Bloom filter is and discuss some of the applications in data science. In a later post, we’ll build a simple implementation with the goal of learning more about how they work.
What is a Bloom Filter? A Bloom filter is a probabilistic data structure. Let’s break that term down. Any time you hear the word “probabilistic” the first thing that should come to mind is “error.
I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them.
I’ve recently been reading a great book on how people make decisions and what organizations can do to help folks make better choices. That book is Nudge.
What is a nudge? The authors describe a nudge as anything that can influence the way we make decisions. Take the primacy affect, for instance, namely the idea that order matters in a series of items. We’re more likely to recall the first or last option in a list of items simply because of their positions.
I used to be a person who would get jealous at others, namely their technical ability. If I thought that the person I was working with were better at math or programming compared to me, it’d cause a drive in me to get better at both of those. I’d pour myself into books on the relevant subjects to try to enhance my ability. I’d work on projects to try to get familiar with these advanced techniques.
Multitasking is a fallacy. Most of the time when we think we’re optimally getting things done by working no multiple tasks or even multiple projects, we’re selling ourselves short.
You’ve all worked with a programmer like this, they’re the person who freaks out or makes a snide remark each time you walk up to them with a question and didn’t “use the proper channels. Put it in an email or a ticket and if you walk up to me again with your problems I’ll…”
Python is a great tool to have available for all sorts of tasks, including data analysis or machine learning. It’s a great language to start off with if you’re a beginner, and there are loads of tutorials out there. So, if you’re a neophyte Pythonista, head over there and come back here later.
Additionally, plenty of great developers have been working on tools that just get the job done, including pandas for wrangling your data (and turning it into something that looks like a spreadsheet), as well as Scikit-Learn for running anything from basic statistics to more complex learning algorithms on your data.
When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)
I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.
I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box.
There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format:
Data consists of rows and columns Your data should be viewable using any common spreadsheet application Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized) Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face) When you encounter some new data, it’s best to strive to fit it into that framework.