ai-learning

Posted on 2026-01-30 Edited on 2026-02-01

how to build a chatgpt

pretraining -> base model

internet text prediction/inference training. —> internet document simulator (just like “exposition in book”)

post training (supervised fine tuning) -> SFT model

conversation training. —> an assistent (just like “example problem & solution in book”)

Same algorithm as pretraining. Just different datasets.

We can add different kind of conversions for LLM to learn and simulate later. These conversions can be used to

know how to answer questions (simulate the conversation vibe)
solve hallucination,
LLM self recogntion,
etc.

reinforcement learning -> RL model (Magic part)

given problem & answer, train LLM to get the practice solutions. (just like “practice problems in book”)

From this point, it’s not imitation anymore. It’s “thinking” by itself. It may go beyond.
In this stage, it’s trying to build a system about how to get the solution. It’s reasoning, not simply predicting & imitating.

un-verifiable domain

RFHL(Reinforcement Learning from Human Feedback) is one of the solution. However this solution has downsides (LLM will game the score system). Thus it can only be used to fine-tuning.

Key Takeaways

tokenization: text -> bytes -> symbols to compress bytes furthermore. Happened before real LLM thinking. It’s deterministic and depends on the tokenization symbols and algorithm it uses. Token is not letter.
know? predict?: LLM doesn’t “know”. It’s just predicting (inference) based on possibilites and simulation.
hallucination mitigation:
- approach1: use model interrogation to discover model knowledge boarder -> sample data set that LLM doesn’t know -> reply as don’t know
  - Use model interrogation to discover model’s knowledge, and
    programmatically augment its training dataset with knowledge-based
    refusals in cases where the model doesn’t know
- approach2: allow the model to search. Just like we firstly search our memory and then fetch related data and then answer.
  - search, code are the tools supported by LLM. When it think something needs to use some tool e.g. search to finish, it adds special xml tag which works as protocol for the text, and when LLM is processing such tag/protocols, it will pause and create another processor to call this tool and then continue, e.g. <SEARCH_START>Who is Orson Kovacs?<SEARCH_END>
Use tool as much as possible:
- it costs less token as create code and execute it is easier then getting the answer in LLM’s “mind”
- add use code explicitly to let LLM use code tool instead of its mind-computing.
- models are not good at counting, spelling, as it’s processing by token. Token is not letter.
vague recollection vs working memory: this is why even though you think it already has the knowledge, e.g. test best practices, and you still needs to add it into prompt to get a better answer.
- Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago)
  Knowledge in the tokens of the context window == Working memory
Models needs tokens to think: for one token batch in the context window, it goes through finite computing layers. Thus for complex question (the token batch in the context window), in one round of finite computition layers, it can’t get good result. Thus, split the complex question to multiple sub question (sub token batch), and let each token batch goes through the computing layers. Finally summarize all the output for these tokens, you could more probably to get a good answer.
- this is why for complex issues, you must split it into multiple stage/workflows, to finally get a good answer.

Questions:

In context learning, why could it understand the context? Normally it can answer question mainly because it’s imitating (which is also predicting) and predicting.
how does the post training conversations be injected by the model? also by training, which is the similar process as pre-training? The differenc is just the dataset is different? In pretrainning, the training data set is internet text. And now it’s conversation texts?
now the model is only inferring. Then why then can take actions, e.g. copy/paste, etc.
when llm using tools, e.g. browser, code, how does it work internally? Is it like, firstly they need to figure out if there needs a tool during the token computing (neural network) and then add xml tags, and then pause the computing and start another process to call the tool???
instead of using bytes-transformed tokens, can we just use letter as a token? Especially for Chinese. There are only aroung 100k Chinese characters. Each Chinese character work as one token.
why markdown is chosen as the main language for prompts? Does it really understand this structure well?

References

Andrej Kapathy’s excalidraw file: https://drive.google.com/file/d/1EZh5hNDzxMMy05uLhVryk061QYQGTxiN/view
datasets: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
leaderboards:
- https://lmarena.ai/leaderboard
- https://scale.com/leaderboard
tokenization: https://tiktokenizer.vercel.app/
LLM visualization: https://bbycroft.net/llm
ai news: https://news.smol.ai/

skill marketplaces

most mature: https://skillsmp.com/
claude-skills-marketplace: https://github.com/mhattingpete/claude-skills-marketplace
anthropics skills: https://github.com/anthropics/skills
claude plugins: https://github.com/anthropics/claude-plugins-official/tree/main/plugins
awesome-claude-code: https://github.com/hesreallyhim/awesome-claude-code#readme
awesome prompts: https://github.com/f/prompts.chat
awesome prompts website: https://prompts.chat/
awsome copiplot: https://github.com/github/awesome-copilot
open claw: https://github.com/openclaw/openclaw
https://aiskillsmarket.com/
for coding agents: https://github.com/obra/superpowers/tree/main