North Bay Python 2024
Remarks by the North Bay Python Organizers
Do you want to create a Python project?
Do you want it to be good?
There have been lots of talks and posts and books about how to improve your project by adopting this or that tool or technology, by adopting a best practice. We all want to improve. We all want our code to be more correct. Faster. Friendlier. Easier to use. Better documented. Better tested. To have better coverage. Easier to contribute to. Easier to work on. More welcoming as a community. More repeatable. Less flaky. More secure. More sustainable. Easier to install. Easier to deploy. Easier to manage. Easier to discover. Easier to discuss, to report issues, to investigate issues, to triage issues.
What if we did all the things though? How could we make our project perfect?
In this talk, I will explore a comprehensive review of all of the best practices available to Python projects; a few specific to open source, but most applicable to just about anything written in Python. In addition to learning about many, many best practices, their importance, and the tools available to facilitate them, I will also propose some solutions to the overwhelming sense of existential dread that one might feel when confronted with all of them at once.
Vector embeddings are a way to encode a text or image as an array of floating point numbers, and they make it possible to perform similarity search on many kinds of content. Let's try to wrap our head around vector embeddings and similarity spaces by exploring them visually! We'll compare different embedding models, different quantization schemes, and different input modalities, using open source tools that produce graphs and charts. Come on a vector voyage!
Many Python applications use Postgres as their backing database - and that means, when we think about our application performance, we have to think about the database. In this talk we'll do a dive into best practices for optimizing Postgres queries, aimed at application developers.
Starting with EXPLAIN ANALYZE (which shows you a Postgres query plan), we'll walk through common situations like how to debug a bad query plan by looking at row estimates, understanding JOIN order, the role of indexes (and which index types exist in Postgres) and how we can understand how much data is being loaded by the database behind the scenes.
Whether you work with Postgres day to day, or only think about it when there is a slow query problem in your Python application, this will help you have a more productive relationship with the database. And even if you use a different database, a lot of the principles of debugging apply universally.
Cocktail Robotics provides an opportunity to explore an entertaining mix of technology, computer human interaction, art, and performance. We will provide a brief history with potentially amusing stories, and an exploration of using Python to control robots.
You might have heard of Hypothesis - a testing library which has been generating test inputs and finding bugs for eleven years now, and used by about 5% of all Python users (Pytest is about 50%). But have you seen the more advanced tricks?
- GASP as
hypothesis write
generates the tests themselves! (without a language model) - THRILL at our new observability tooling - you'll never wonder what happened again!
- BRACE YOURSELF for a workflow with coverage-guided evolution, the black art of SMT solving, and a distributed database!
Come one, come all, and enjoy a live demo that you won't soon forget! With a little luck, you might even find something practical to take away for your own testing....
A bridge can both literally and metaphorically connect people of different communities together.
There are many reasons why distinct communities exist within the software space, some good, some bad, some natural. For example, "baby duck syndrome"denotes a situation where a computer user "imprints" on the first system they learn. This can lead to identifying within the context of that community.
Within the Python space, we can see small and large communities converging around a certain package or framework. These communities can be vibrant, supportive, and generally wonderful.
I suggest ways in which we can "build bridges" from these communities in order to learn from each other, promote diversity, and break down barriers. This will not only benefit those of us within these communities, but also newcomers looking to find their footing.
We’re hopefully all on board with writing documentation for our projects. However, especially with the rise of supply-chain attacks, there are some aspects of our projects that we really shouldn’t document, and should instead remediate as vulnerabilities. If we do document these aspects of a project, it may help someone compromise the project itself or our users. In this talk, you will learn why some aspects of documentation may help attackers more than users, how to recognize those aspects in your own projects, and what to do when you encounter such an issue.
We introduce marimo, an open-source reactive notebook for Python that addresses several common complaints about first-generation notebooks.
marimo notebooks are reproducible, with a reactive runtime that eliminates hidden state; interactive, with UI elements that are automatically synchronized with Python (no callbacks); expressive, supporting markdown that can be parametrized by arbitrary Python values; stored as pure Python files, so they are Git-friendly; executable as scripts; and shareable as web apps or WASM-powered static HTML.
marimo is used today by scientists and developers at several companies and research institutions, including SLAC and Stanford.
A reactive programming environment
marimo keeps code, outputs, and program state consistent. Run a cell and marimo reacts by automatically running the cells that reference its declared variables. Delete a cell and marimo scrubs its variables from program memory, eliminating hidden state.
Our reactive runtime is based on static analysis, forming a dataflow graph based on variable declarations and references. To ensure the dataflow graph is well-formed, marimo imposes two constraints on user code: variables can be defined in at most one cell, and cyclic references across cells are disallowed.
The marimo library
marimo is both a notebook and a library — importing the marimo library provides the user with utilities for authoring dynamic markdown; creating interactive UI elements; rendering progress bars; and more.
marimo's interactive elements feed into reactivity: interacting with elements such as sliders or selectable plots automatically sends their values to Python and triggers execution of cells referencing variables bound to the interacted-with elements. We extend this rule to support higher-order elements such as submittable forms, dictionaries, and arrays of constituent elements.
A pure Python file format
marimo notebooks are stored as pure Python files, designed so that small changes in notebook code yield small diffs. These files are also executable, with cells run in a topologically sorted order. We discuss the design of this file format, as well as trade-offs made.
Shareability
marimo is easily shared: notebooks can be run as read-only apps from the command line, and exported as interactive WASM-powered static HTML.
As Large Language Models (LLMs) gain trust across various sectors for tasks ranging from generating text to solving complex queries, their influence continues to expand. Yet, this trust is shadowed by significant risks, such as the subtle yet serious threat of data poisoning. This talk will delve into how deceptively crafted data can infiltrate an LLM’s training set, leading these models to propagate errors, biases, or outright fabrications—a real challenge to the integrity of their outputs.
While there are various algorithms and approaches designed to mitigate these risks, this session will focus particularly on the Rank-One Model Editing (ROME) algorithm. ROME is notable for its ability to edit an LLM's knowledge in a targeted manner after training, providing a means to recalibrate AI outputs. However, it also presents a potential for misuse, as it can be employed to embed false narratives deeply within a model.
Key Discussion Points:
- Why People Trust LLMs: Exploring the reasons behind the widespread trust in LLMs and the associated risks.
- The Art of Data Poisoning: A closer look at how maliciously crafted data is inserted into training sets and its profound impact on model behavior.
- Focus on ROME: Discussing how the Rank-One Model Editing algorithm can both safeguard against and potentially contribute to the corruption of LLMs.
- Ethical Considerations: Reflecting on the ethical implications of manipulating the knowledge within LLMs, which requires not just technical skill but also wisdom and responsibility.
This presentation is designed for data scientists, AI researchers, and Python enthusiasts interested in understanding the vulnerabilities of LLMs and the tools available to protect these systems. While acknowledging other algorithms and methods, this talk will provide a quick demonstration of ROME, offering insights into its utility and dangers.
As people continue to integrate LLMs into everything, we must remain vigilant against the risks of data manipulation. This session challenges us to consider whether we are paying enough attention to these threats, or if we are, metaphorically, just fiddling while Rome burns—allowing foundational trust in data to erode.
Join me in this exploration of ROME, where we navigate the fine balance between correcting and corrupting the digital minds that are—whether we like it or not—becoming an integral part of our technological landscape.
Close of proceedings, Saturday.
Remarks from the North Bay Python Organizers
Python is great! It's been a mainstay of web development and systems programming for decades and is on the cutting edge of many fields like scientific computing. But there is always more to improve, both in the language itself and how we use it. This talk will look at how ideas and features from other languages like Ruby, Go, and PHP could be used to improve Python!
A controlled environment and consistent dependencies are crucial to writing good and – most importantly – relevant tests in Python. While the advent of APIs has made using external services so much more accessible, APIs can lead to flaky or deceptive tests, ultimately putting applications at risk. In this talk, you will learn how to use Python’s Mock object to create more reliable stand-ins for APIs beyond your control … all within the unittest framework.
In this talk, aimed at Pythonistas of all levels, we will learn how ActivityPub, the protocols underlying distributed social networks, power the open web by building a ActivityPub server in Python. The concepts that make up ActivityPub will be learned with concrete Python examples instead of abstract concepts, with the hope that attendees have a better handle on the technologies powering Mastodon, PixelFed, and the rest of the Fediverse.
WASM is a compiled set of instructions that can run in the browser and it can be used to as a compilation target. So, it's possible to compile CPython into WASM, like in PyOdide.
It's possible to write your own WASM interpreter, which can run those programs compiled to WASM. You can even write such a WASM interpreter in Python.
You can imagine what happens next: we're going to run Python in WASM in Python... and maybe more!
“I love breaking changes!” said no developer, ever. Backwards incompatibility can be disruptive to a community, but sometimes it’s a necessary evil in the long run to move forward.
Haystack is a free open source Python LLM framework. It was launched in 2020, before LLMs were cool. In 2023 we decided to undergo a major re-architecture, culminating in the release of Haystack 2.0. It wasn't an easy decision. By involving the open source community in our design process early on, we are confident we built a more usable, flexible foundation for years to come.
In this talk you'll learn how we designed abstractions with the right level of flexibility / composability in the rapidly changing LLM landscape. We'll not only show you the new features Haystack 2.0 provides, but we will also give you a peek into our future roadmap. You'll walk away with a better understanding of how modern LLM frameworks can help you solve problems at scale with LLM technologies, as well as an enriched understanding of how to think for the long-term when building for an open source community.
Introducing Magql (pronounced "magical"), a framework for defining GraphQL APIs, including generating from SQLAlchemy models and integrating with Flask. Magql is extensible, looks familiar to developers who are used to Flask, and provides convenience on top of the "official" Python GraphQL library. I'll discuss why I decided to write this library, and how I went about it, including some of the technical challenges presented by GraphQL's complexity. GraphQL's nested schema and query language allows for some unique and weird possiblities compared to traiditional HTTP/REST APIs, and I'll show off some clever examples that demonstrate its potential. But that complexity can also lead to difficulties. Ultimately, the decision between GraphQL and REST is not clear cut, and this talk will also discuss some of our findings after using Magql and GraphQL in production, and where we want to go next.
Python is widely used as both a professional software engineering language,
and as a "learning" language that many programmers get started with. But in the
early days of computing there was BASIC, the Beginners All-purpose Symbolic
Instruction Code. BASIC's original developers, Kemeny and Kurtz at Dartmouth
University, created it as a language that all students, from every field could
use, and that led to its widespread adoption. It quickly become the de facto
home computer language, and was pre-installed on virtually all computers right
up through the year 2000.
We'll review the syntax similarities and differences between Python and BASIC,
compare the source code and execution of several identical programs written in
both languages, and discuss the implementation of their respective "compilers".
Maybe BASIC can still teach us, and future language designers, some lessons?
In 2020, xkcd published Dependency, which posited that "all modern digital infrastructure" is ultimately transitively dependent on "a project some random person in Nebraska has been thanklessly maintaining since 2003".
How can we find these projects and ensure that their maintainers get the thanks and — more importantly — the resources they need?
Close of proceedings, Sunday