October 23, 2012

Basic language structures project

Listen to MP3

Podcast transcription:

Hello this is Vladimir, coming to you from Taipei, Taiwan. It‘s been a very long time since I did a recording for my podcast and it’s been quite a while since I wrote an article as well and so I decided that this has to end that I have to dedicate more time to my blog and my podcast.

There were several reasons for not writing and not doing recordings, mostly because since the last time I wrote an article which was probably sometime in March or April - since then we’ve been having wonderful weather and I just felt bad whenever I spent time inside. And another reason probably is also the fact that it takes me a lot of time to come up with an interesting topic and something that I personally think would be worth writing an article about, let alone to do a recording about and it takes a lot of time to work on the article itself, to work out the details and if it's recording it takes a lot of time to work on the recording so that it looks good and sounds good. So whenever I was out in the beautiful weather and I was thinking that should go back and work on the recordings I just sort of postponed and postponed basically until now.

But the summer is over, unfortunately, but it's a good thing for my podcasting projects. In the meantime of course I was working on some languages and was learning some languages mainly Farsi and Polish and of course because I love languages and I love to think about how they work, how we learn languages and how we can learn them faster and all these things, I came up with several interesting things that I kind of developed and some of them actually turned into small research projects. Of course I didn‘t write an article now but at least I would like to do a podcast recording to introduce one of these projects because it is a long-term thing and it will take a long time until its really finished and until I can post the results. So at least what I thought I could do is do a little recording about this project and talk about it a little bit

So basically what I did and first, I’ll give you the name - actually it doesn't even have a name because it’s so complicated so I guess what I could call it is sort of like the research off the basic startup vocabulary and lexical structures necessary for a person to speak a language at basic fluency. To give you some background about this: back when I was learning Italian and then later mainly when I was learning Polish and then Farsi I realized that I was looking, while learning these languages I was looking for the same types of structures that kept popping up here and there in the language and that the structures were very similar in all of the languages whether it was Chinese, Italian or Polish. Back when I was learning Italian I actually did a little project where I was trying to sample all the vocabulary that I was using on a daily basis in Italian but back then I should’ve extended it and I should have mapped also the expressions not only the vocabulary.

When I started learning Polish and Farsi I changed my approach from a scholastic one, which is learning a language through grammar and textbooks and basically going from simple to difficult and I changed it to a system which I guess is called natural acquisition of a language which I have used at later stages when I was learning Chinese and proved to be very successful at least for me personally I thought was it was working out really well. So when I was learning Polish I basically had a great friend online on Skype and from lesson one I was just trying to talk to him in Polish and our lessons were basically me trying to say something in Polish and him telling me how to say that without really thinking about grammar at all and just writing these things down and then reviewing them and then moving on.

So basically just to give you an idea we would have a lesson and  probably the first sentence that I would want to say was, „Well, how do you say ‚hello‘“ and then he would say, „well this is how you say it“ and I would write it down and then we would proceed, because of course ‚hello‘ is the most simple thing that you can say and one of the most basic things, but I mean even at basic stages you need sentences like „how do you say ‚book‘ in Polish“. So this is a structure right „how you say something in the language“ and it’s a very useful structure because in most languages that you learn you if you want to learn through a discussion not a textbook you are going to use the structure very often. And whether you realize it or not there is a lot in this simple sentence „how do you say ‚book‘ in Polish? “ there‘s quite a bit of grammar in there and other things, but if you learn it as a structure you just bypass all of this because of course it is how natural acquisition methods work and this what they take advantage of.

So anyway I was proceeding through the language and I was writing down these expressions and later I started learning Farsi and decided to do that in the very same way which was just find a conversation partner and to do same thing - just start talking and try to say what ever I wanted to say and write down these expressions, than one by one review them and then just continue and continue.

So what I did was basically I produced two long, well one of them is very long, the Polish list for these basic expressions from lesson 1 to lesson maybe, well at least maybe 25 or 30. So there’s a quite a bit of expressions in that list. And I have another list which is in Farsi which has up until now 291 expressions in them. And of course because I had the experience with Italian and Chinese I knew that these expressions would more or less in Polish and Farsi and Chinese they would, I would be looking for the same things and I just wanted to map them out and see how and how frequent they are and when they occur in the language learning process.

So basically when I decided that I would like to research these expressions and sort of systemize them I had two tasks: the first one was just to simply sample them - I was already doing this, I was making these lists as we were having our lessons. And the second was to come up with an organized list and I will talk about technicalities later about how this is done and why it takes such a long time.

So what I realized when I was coming up with these lists was that.. because my aim was to come up with a list of the first 1000 or 2000 expressions that people come across naturally when they start picking up a language. There is a tremendous amount of analysis that you can make with this data. For instance if you come up with data of this sort where an adult is learning a language such as Polish and Farsi and you can compare which expressions came earlier in the learning process and which came later and of course this is very independent and individual because maybe the topics that you were talking about in the first lesson required a little bit more of past tense in them, or maybe the topic required more conditional and so on so it is very individual when it comes to the order of occurrence of these expressions. But what you can do is for instance you can compare the Polish and the Farsi lists with the Chinese - which expressions came first and which came later and you can do several things with this. I realized that just because it's so individual, it‘s basically a list which relates to my language acquisition process and specifically Polish and Farsi so I had to come up with a second point of reference and that point of reference is basically a frequency list of these expressions so what I’m going to do when I’m done with this and when hopefully I’m going to speak at a basic fluency level, I‘m going to record several hours of dialogues and I will transcribe them into a written format and run an analyzer program on the data and see the frequency of occurrence of these expressions.

So ideally what I will have at the end will be two lists. The first list will be a list of occurrence of these expressions as they appeared in the language learning process - which ones came first and which ones came later in the language learning process and the second list will be a frequency list of the same expressions based on how often they appear and a pop-up in dialogues and real-life situations. And then in the end I will combine these two lists and come up with a final list of expressions based on their frequency and the order of their occurrence.

As you can imagine, this is not a very easy thing to do and there are several problems when it comes to this. Just name a few for instance the data entries themselves: if we have expression for instance „I forgot what I wanted to tell you“. This a set phrase that we use very often. When I put it as an entry should I say „to forget what one wants to say“ which would be the neutral form, or put it in the entry as I encountered it „I forgot what I wanted to tell you“ or should I.. in this case this would probably also be the most frequent one but this is not the case for every expression. Sometimes there is a different form of the same expression which is more frequent than the one that I used so the question is which one should I enter into the list. Is it the neutral form? Is it the form that is the form that I encountered the expression in or is it the most common form, the form that most people use.

Then of course, and this is a technical thing and I will talk about this later again, but when I‘ll have the two lists where the first will be the frequency and the second list will be the occurrence list, which one will have the priority when I combine them and I will talk about it later how I’ll actually do it, because I will not do it manually. I’ll have a program to do it for me.  Will the occurrence list have priority over the frequency list or the other way around or will there be a ratio of priority between them which would be 60 to 40 or 70to 30? I don't know. This is something I have to figure out and maybe I will really end up doing all of these things manually because it is just a very sensitive thing to work with for me personally to come up with a list like that, if I really want to have good results and in this sort of research I need to put in as much effort as I can so probably I will actually end up doing these things manually.

The next thing is that, well, how will I know at what time I'll have enough expressions. I set my goal to about 1000 - 1500 expressions but if for instance I will actually hit the mark of 1500 expressions and I will say: „This is enough. I can start I can start combining the two lists and I can start coming up with the end result.“ Is it really going to be enough? Are the expressions that made it to the first 1500, are they really the most important ones, or are there some that I just didn't get the chance to use and are much more frequent in real life speech?  So this is this is another problem that I will have. And then there are some certain small technical issues that are related to the frequency lists because I told you that what I will do is that I will record several hours of spoken Farsi and I will translate that into English and I’ll have a text file which will serve as a databank and which I will use with some software that will count the frequency of the expressions in there. So let’s just take the instance of the expression that I used before „I forgot what I wanted to tell you“. So this is a very long expression „I forgot what I wanted to tell you“. That‘s eight words and programs that sort of analyze word frequency, if you only want to analyze word frequency it's very easy. The program basically tells you how many times the word occurred in the text and it knows that it's a word because there is always a space key after or before this word. So whenever there‘s a space key and whatever's between the two space keys is counted as one data entry. But if I have an expression which is „I forgot what I want to tell you“ it is of course going to be much more difficult and so probably I will have to go through the text and manually sort of correct the data, for instance delete the spaces between the words, so it will form one long word and because maybe the expression will be in a different form, like I said: „he forgot what I wanted to tell you“, or „She forgot what they wanted to tell us“ but it’s still the same structure so I’ll still probably have to edit the text file so that the expressions are in their most frequent forms, so it’s going to take a long time.

There are several other things. Once I produce this list - if I ever do it - there are several other things that I will be able to do with that sort of data. For instance, once I‘ll have the data from the dialogues I could analyze how many times for instance the first, second and third person in verbs occurred. How many times I use plural or singular. How many times the past tense, future tense occurred and then many other things and this for me personally is very interesting.  Maybe there have been studies like this and probably there have, but for me personally I have never seen in a textbook mentioned how many times the second person plural is used, because I have the feeling and this is just a very personal feeling that that when I use verbs, I mostly use the first-person, second person and third person singular and less so I use the second person plural. It might not seem like such a big deal but you know when you're a beginning student for instance, especially if you are maybe not of Indo-European language origin, if you’re Chinese or something like that and they throw at you all the three persons and two numbers of the verb, you are overwhelmed by them. If you’re learning languages like Italian for instance where you have three verb groups, I think it would be pretty useful for the students to know that there are three persons and two numbers, but that the second person plural for instance is not used so often lets say in direct speech or maybe the third person singular and plural are used much more often in the news, so maybe you can look out for that and maybe concentrate and do your best to  remember the first-person singular and so on and so on.

Another thing that this list could serve for, could be a basis for a natural acquisition language learning course. I know that there are several people who use word lists, gold lists or 10,000 sentences methods, but like I said I don't really know how these lists came about and who and how decided which sentence will actually make it into 10,000 sentences list and which will not. I have a feeling that to get you off the ground and get you speaking you need much less than 10,000 sentences you just need.. I‘m not saying it‘s going to be only 1500 sentence structures, but I'm saying it is much less than 10,000 sentences.

The whole point of this is basically to get you to a basic fluency level which for me personally is a level where you can start enjoying the language learning process, because all you have to do is talk to people and learn most of the things out of context and to get you there you could just memorize all these expressions that could get you off the ground and then they let you learn a language in the natural way without really having to think too much about grammar and think too much about what is going on in the sentence structure and so on.

The only thing that you will need to learn is which part of the structure is permanent and which part is a substituent block and usually these blocks are nouns or adjectives or other verbs and so on. So like I said for instance if you have the expression: „I forgot what I wanted to tell you” then you know that you can change all the pronouns: „He forgot what she wanted to tell us“, and so on.

Another thing that I could do is to translate this list into other languages. The basis will be English. I don’t know how well this will work until someone tries it out so I don‘t even know if it’s useful as a language learning tool and how useful it will be but by translating this into other languages at least you will see is that these expressions do exist in all of these languages in and how they are used and how they are said.

Of course with languages as different as Chinese there‘s going to be a whole bunch of expressions that Chinese have and we don't have simply because of cultural differences or simply because of the fact that Chinese works in a completely different way but I have really realized that even in such a distant language as Chinese is, I'm looking for the same things to say and you can say these things in Chinese as well, but usually in a very different way. And this is actually one of the things that I want to achieve later on if this list ever makes it – to translate the list into Chinese.

In conclusion I just want to say that obviously this is going to take me forever because my main two order of occurrence lists for Polish and Farsi.. the Polish one is ready but the one for Farsi is not ready yet and I'm basically quite far from making it complete so it’s going to take a while but I'm working on it steadily. I have three Farsi lessons per week and I add about 30 expressions about 20 to 30 suppressions every lesson. So it’s going to take a while but steadily I’ll get there and by the time I do that I hope to come up with a computer program that will help me combine the two lists and another program that will help me with the frequency computations with the dialogues data.

In the end for me of course writing the article is also very difficult, because English is not my native language and to write articles in English is more difficult than writing them in Slovak. There is also always the option to write the article first is Slovak and then translate it into English but either way takes a lot of time so I really don't know when I'm going to finish this. I just really hope that will, because think it's an interesting project at least for me to work on.

I know that this has been done before many times. There are linguistic departments all over the world that research language acquisition, secondary language acquisition, native language acquisition and definitely they research also the necessary vocabulary frequency and so on.  I‘m not influenced by them so I don’t know what methods they use and I wanted, without this influence, with my own system come up with this basic expressions list and maybe compare what I come up with and what they come up with. I know that there are very professional people working in these departments and definitely speak a lot of languages. I have maybe the advantage that I’ve been learning languages as different as Chinese and Polish and Farsi and Italian and then combine the expressions and the knowledge of these languages and come up with a list of these expressions which would maybe be ordered a bit differently than with what the linguistic departments would come up with.

So that’s about it for the project. Like I said I don’t know when I'm going to finish it. It can take a very long time but I hope I do and in the meantime I would like to do some more recordings maybe with topics that would not be so specific, maybe talk a little bit about motivation when it comes to language learning, maybe talk about pronunciation and all these things that a lot of people ask me about and so I hope to talk to you soon, take care and have a good time. Bye bye.


  1. Cześć Vladimir!

    Pozdrowienia z Polski i powodzenia w nauce polskiego:) Jak idzie nauka i dlaczego w ogóle zacząłeś uczyć się języka polskiego?


  2. Czesc Aga:)

    dziekuje bardzo za komentarz. Nauka polskiego niestety nie idzie bardzo dobrze, bo v ciagu ostatnich miesicu nie mialem wielu czasu i motiwacji:( Ostatio motiwacji mam, ale czasu nadal nie:)Bende sie staral poswiecic wiecej czasu nauce.

    Juz bardzo dobrze nie pamientam, dlaczego zaczalem sie uczyc polskiego, ale mysle, ze dlatego, ze jest podobny Slowackiemu i dobrze brzmi:)

    dziekuje za pozdrowienia


  3. Great endeavor Vladimir. I for one would be very interested by such a list...

    The thing is, we find many vocabulary lists on the internet, either basic (such as http://www.towerofbabelfish.com/Tower_of_Babelfish/Base_Vocabulary_List.html) or more advanced, but rarely do we see lists of expressions, and usually those are only adequate for beginners.

    If you want to make this kind of work useful for others without it beeing overwhelming you should have a version of the expressions in English, even though it's not your mother tongue! Concerning the frequency question, why not use Google to know how frequent they are? Searhing for them (with quotes in order to have the exact expression) will give you a good estimate of how often they are used.

    Final thing: Isn't 1000 or 1500 expressions a big number? Why not divide them into smaller chunks of 50s or 100s, and share them gradually?

    Thanks for the article,

  4. Dear Antoine,

    thank you for the comment.

    I'm still working on this project. I haven't given up:)

    Actually even though the languages that I'm sampling are Farsi and Polish, I write the entries into my excel file in English.

    Thank you for the google suggestion. It might be a good idea to use it in case two expressions will have a similar count in my statistics.

    I will definitely share the results of this small project if I finish it, but I'd like to wait until I have enough data. I'm afraid, that if I'd share only 50 expressions at a time, they might be out of order with the full list.

    To give an idea of what sort of expressions I'm looking for, I'll just randomly pick some expressions that showed up in my first Farsi practice sessions. The whole list later will have to be adjusted for frequency:

    entonces/donc/quindi/deswegen equivalent
    I think..
    it depends
    to be similar
    that's why ..
    what does it mean?
    up until now
    I'd like to
    How do you say .. in Farsi?
    not yet
    That's different
    to tell the truth, ..
    I can/I'm able to ..
    I'm trying to ..
    It's impossible
    I don't feel like ..
    Unfortunately I ...
    A long time ago
    I'll do my best
    Like this/like that

  5. Thank you for sharing this Vlad. Are you still working on this project? If so do you have a version in Italian that you can share with us? Thank you again for your fantastic tips. It made my jogging this morning a little more interesting and stimulating :)

  6. Hello dear Hauzer. I'm very happy that someone is interested in this project!:) Unfortunately I do not have an Italian version and I am not working on the project any more. I was seriously demotivated to learn languages for about a year and a half and am only slowly recovering:) If I take on this project again, I'll think of the Italian translation too and of course share it with anyone interested. Take care. Vladimir

  7. Hi Vladimir! I've recently come across your youtube page and I found your language tips to be very useful, already started listening to more vloggers and read more, out loud as well! :) Thank you for the inspiration! A little kick is what I needed to start learning languages again..
    The project looks very interesting. Are you still working on it?
    Best regards,

  8. Hello Angie,

    thank you:)

    I'm still working on the project, but not very intensively. I do plan to build up a language course based on it though later this year.